Diagnostic methods

ABSTRACT

A method for analyzing a disease state of a subject includes characterizing the subject&#39;s genetic information at two or more time points or instances with a genetic analyzer, e.g., a deoxyribonucleic acid (DNA) sequencer, and using the information from the two or more time points or instances to produce an adjusted test result in the characterization of the subject&#39;s genetic information.

CROSS-REFERENCE

This application is a continuation application of InternationalApplication No. PCT/US2016/030301, filed Apr. 29, 2016, whichapplication claims the benefit under 35 U.S.C. §119(e) of U.S.Provisional Application No. 62/155,755, filed May 1, 2015, whichapplication is incorporated herein by reference.

BACKGROUND

Cancer is a major cause of disease worldwide. Each year, tens ofmillions of people are diagnosed with cancer around the world, and morethan half of the patients eventually die from it. In many countries,cancer ranks the second most common cause of death followingcardiovascular diseases.

To detect cancer, several screening tests are available. A physical examand history surveys general signs of health, including checking forsigns of disease, such as lumps or other unusual physical symptoms. Ahistory of the patient's health habits and past illnesses and treatmentswill also be taken. Laboratory tests are another type of screening testand may require medical procedures to procure samples of tissue, blood,urine, or other substances in the body before conducting laboratorytesting. Imaging procedures screen for cancer by generating visualrepresentations of areas inside the body. Genetic tests detect certaingene deleterious mutations linked to some types of cancer. Genetictesting is particularly useful for a number of diagnostic methods.

One approach for cancer screening may include the monitoring of a samplederived from cell free nucleic acids, a population of polynucleotidesthat can be found in different types of bodily fluids. In some cases,disease may be characterized or detected based on detection of geneticvariations, such as a change in copy number variation and/or sequencevariation of one or more nucleic acid sequences, or the development ofother certain rare genetic alterations. Cell free DNA (“cfDNA”) maycontain genetic variations associated with a particular disease. Withimprovements in sequencing and techniques to manipulate nucleic acids,there is a need in the art for improved methods and systems for usingcell free DNA to detect and monitor disease.

SUMMARY

In an aspect, the present disclosure provides a method for analyzing adisease state of a subject, comprising (a) using a genetic analyzer togenerate genetic data from nucleic acid molecules in biological samplesof the subject obtained at (i) two or more time points or (ii)substantially the same time point, wherein the genetic data relates togenetic information of the subject, and wherein the biological samplesinclude a cell-free biological sample; (b) receiving the genetic datafrom the genetic analyzer; (c) with one or more programmed computerprocessors, using the genetic data to produce an adjusted test result ina characterization of the genetic information of the subject; and (d)outputting the adjusted test result into computer memory.

In some embodiments, the genetic data comprises current sequence readsand prior sequence reads, and wherein (c) comprises comparing thecurrent sequence reads with the prior sequence reads and updating adiagnostic confidence indication accordingly with respect to thecharacterization of the genetic information of the subject, whichdiagnostic confidence indication is indicative of a probability ofidentifying one or more genetic variations in a biological sample of thesubject.

In some embodiments, the method further comprises generating aconfidence interval for the current sequence reads. In some embodiments,the method further comprises comparing the confidence interval with oneor more prior confidence intervals and determining a disease progressionbased on overlapping confidence intervals.

In some embodiments, the biological samples are obtained at two or moretime points including a first time point and a second time point, andwherein (c) comprises increasing a diagnostic confidence indication in asubsequent or a previous characterization if the information from thefirst time point corroborates information from the second time point. Insome embodiments, the biological samples are obtained at two or moretime points including a first time point and a second time point, andwherein (c) comprises increasing a diagnostic confidence indication inthe subsequent characterization if the information from the first timepoint corroborates information from the second time point.

In some embodiments, a first co-variate variation is detected in thegenetic data, and wherein (c) comprises increasing a diagnosticconfidence indication in the subsequent characterization if a secondco-variate variation is detected.

In some embodiments, the biological samples are obtained at two or moretime points including a first time point and a second time point, andwherein (c) comprises decreasing a diagnostic confidence indication inthe subsequent characterization if the information from a first timepoint conflicts with information from the second time point.

In some embodiments, the method further comprises obtaining a subsequentcharacterization and leaving as is a diagnostic confidence indication inthe subsequent characterization for de novo information. In someembodiments, the method further comprises determining a frequency of oneor more genetic variants detected in a collection of sequence readsincluded in the genetic data and producing the adjusted test result atleast in part by comparing the frequency of the one or more geneticvariants at the two or more time points. In some embodiments, the methodfurther comprises determining an amount of copy number variation at oneor more genetic loci detected in a collection of sequence reads includedin the genetic data and producing the adjusted test result at least inpart by comparing the amount at the two or more time points. In someembodiments, the method further comprises using the adjusted test resultto provide (i) a therapeutic intervention or (ii) a diagnosis of ahealth or disease to the subject.

In some embodiments, the genetic data comprises sequence data frompotions of a genome comprising disease-associated or cancer associatedgenetic variants.

In some embodiments, the method further comprises using the adjustedtest result to increase a sensitivity of detecting genetic variants byincreasing read depth of polynucleotides in a sample from the subject.

In some embodiments, the genetic data comprises a first set of geneticdata and a second set of genetic data, wherein the first set of geneticdata is at or below a detection threshold and the second set of geneticdata is above the detection threshold. In some embodiments, thedetection threshold is a noise threshold. In some embodiments, themethod further comprises, in (c), adjusting a diagnosis of the subjectfrom negative or uncertain to positive when the same genetic variantsare detected in the first set of genetic data and the second set ofgenetic data in a plurality of sampling instances or time points. Insome embodiments, the method further comprises, in (c), adjusting adiagnosis of the subject from negative or uncertain to positive in acharacterization from an earlier time point when the same geneticvariants are detected in the first set of genetic data at an earliertime point and in the second set of genetic data at a later time point.

In some embodiments, the disease state is cancer and the geneticanalyzer is a nucleic acid sequencer.

In some embodiments, the biological samples include at least twodifferent types of biological samples. In some embodiments, thebiological samples include the same type of biological sample. In someembodiments, the biological samples are blood samples. In someembodiments, the nucleic acid molecules are cell-free deoxyribonucleicacid (DNA).

In another aspect, the present disclosure provides a method of detectinga trend in the amount of cancer polynucleotides in a biological samplefrom a subject over time, comprising determining, using or moreprogrammed computer processors, a frequency of the cancerpolynucleotides at each of a plurality of time points; determining anerror range for the frequency at each of the plurality of time points toprovide at least a first error range at a first time point and a seconderror range at a second time point subsequent to the first time point;and determining whether (1) the first error range overlaps with thesecond error range, which overlap is indicative of stability offrequency of the cancer polynucleotides at a plurality of time points,(2) the second error range is greater than the first error range,thereby indicating an increase in frequency of the cancerpolynucleotides at a plurality of time points, or (3) the second errorrange is less than the first error range, thereby indicating a decreasein frequency of the cancer polynucleotides at a plurality of timepoints.

In some embodiments, the cancer polynucleotides are deoxyribonucleicacid (DNA) molecules. In some embodiments, the DNA is cell-free DNA.

In some embodiments, the frequency at each of the plurality of timepoints is determined by sequencing nucleic acid molecules in biologicalsamples of the subject. In some embodiments, the biological samples areblood samples. In some embodiments, the nucleic acid molecules arecell-free deoxyribonucleic acid (DNA).

In another aspect, the present disclosure provides a method to detectone or more genetic variations and/or amount of genetic variation in asubject, comprising sequencing nucleic acid molecules in a cell-freenucleic acid sample of the subject with a genetic analyzer to generate afirst set of sequence reads at a first time point; comparing the firstset of sequence reads with at least a second set of sequence readsobtained at least at a second time point before the first time point toyield a comparison of first set of sequence reads and the at least thesecond set of sequence reads; using the comparison to update adiagnostic confidence indication accordingly, which diagnosticconfidence indication is indicative of a probability of identifying oneor more genetic variations in a cell-free nucleic acid sample of thesubject; and detecting a presence or absence of the one or more geneticvariations and/or amount of genetic variation in nucleic acid moleculesin a cell-free nucleic acid sample of the subject based on thediagnostic confidence indication.

In some embodiments, the method further comprises obtaining thecell-free nucleic acid molecules from the subject.

In some embodiments, the method further comprises sequencing additionalcell-free nucleic acid molecules of the subject to generate a third setof sequence reads at a third time point subsequent to the first timepoint, and detecting a presence or absence of the one or more geneticvariations and/or amount of genetic variation in the additionalcell-free nucleic acid molecules of the subject based on the diagnosticconfidence indication.

In some embodiments, the method further comprises increasing thediagnostic confidence indication if information obtained from the firstset of sequence reads at the first time point corroborates informationobtained from the at least the second set of sequence reads at thesecond time point.

In some embodiments, the method further comprises decreasing thediagnostic confidence indication if information obtained from the firstset of sequence reads at the first time point does not corroborate orconflicts with information obtained from the at least the second set ofsequence reads at the second time point. In some embodiments, the methodfurther comprises leaving as is the diagnostic confidence indication ina subsequent characterization for de novo information.

In another aspect, the present disclosure provides a method fordetecting a mutation in a cell-free nucleic acid sample of a subject,comprising: (a) determining consensus sequences by comparing currentsequence reads obtained from a genetic analyzer with prior sequencereads from a prior time period to yield a comparison, and updating adiagnostic confidence indication based on the comparison, wherein eachconsensus sequence corresponds to a unique polynucleotide among a set oftagged parent polynucleotides derived from the cell-free nucleic acidsample, and (b) based on the diagnostic confidence, generating a geneticprofile of extracellular polynucleotides in the subject, wherein thegenetic profile comprises data resulting from copy number variation ormutation analyses.

In some embodiments, the method further comprises prior to (a),providing a plurality of sets of tagged parent polynucleotides derivedfrom the cell-free nucleic acid sample, wherein each set is mappable toa different reference sequence.

In some embodiments, the method further comprises: using the consensussequences to normalize ratios or frequency of variance for each mappablebase position and determining actual or potential rare variant(s) ormutation(s); and comparing a resulting number for each region withpotential rare variant(s) or mutation(s) to similarly derived numbersfrom a reference sample.

In another aspect, the present disclosure provides a method to detectabnormal cellular activity, comprising: providing at least one set oftagged parent polynucleotides derived from a biological sample of asubject; amplifying the tagged parent polynucleotides in the set toproduce a corresponding set of amplified progeny polynucleotides; usinga genetic analyzer to sequence a subset of the set of amplified progenypolynucleotides to produce a set of sequencing reads; and collapsing theset of sequencing reads to generate a set of consensus sequences bycomparing current sequence reads with prior sequence reads from at leastone prior time period and updating a diagnostic confidence indicationaccordingly, which diagnostic confidence indication is indicative of aprobability of identifying one or more genetic variations in abiological sample of the subject, wherein each consensus sequencecorresponds to a unique polynucleotide among the set of tagged parentpolynucleotides.

In some embodiments, the method further comprises increasing thediagnostic confidence indication if the set of sequencing reads isidentified in the at least one prior time period. In some embodiments,the method further comprises decreasing the diagnostic confidenceindication if the set of sequencing reads is not identified in the atleast one prior time period. In some embodiments, the method furthercomprises keeping the diagnostic confidence indication unchanged if theset of sequencing reads is identified in the at least one prior timeperiod but is nonconclusive.

In some embodiments, the set of sequencing reads comprises at least onesequencing read.

In some embodiments, the biological sample is a blood sample. In someembodiments, the biological sample comprises cell-free nucleic acidmolecules, and at least one set of tagged parent polynucleotides aregenerated from the cell-free nucleic acid molecules.

In some embodiments, the method further comprises generating a geneticprofile of polynucleotides of the subject, which genetic profileincludes an analysis of one or more genetic variants of the subject. Insome embodiments, the polynucleotides include extracellularpolynucleotides.

In another aspect, the present disclosure provides a method fordetecting a mutation in a cell-free or substantially cell free sample ofa subject comprising: (a) sequencing extracellular polynucleotides froma bodily sample of the subject with a genetic analyzer; (b) for each ofthe extracellular polynucleotides, generating a plurality of sequencingreads; (c) filtering out reads that fail to meet a set threshold; (d)mapping sequence reads derived from the sequencing onto a referencesequence; (e) identifying a subset of mapped sequence reads that alignwith a variant of the reference sequence at each mappable base position;(f) for each mappable base position, calculating a ratio of (i) a numberof mapped sequence reads that include a variant as compared to thereference sequence, to (ii) a number of total sequence reads for eachmappable base position; and (g) using one or more programmed computerprocessors to compare the sequence reads with other sequence reads fromat least one previous time point and updating a diagnostic confidenceindication accordingly, which diagnostic confidence indication isindicative of a probability of identifying the variant.

In some embodiments, the bodily sample is a blood sample. In someembodiments, the extracellular polynucleotides include cell-freedeoxyribonucleic acid (DNA) molecules.

In another aspect, the present disclosure provides a method foroperating a genetic test equipment, comprising: providing initialstarting genetic material obtained from a bodily sample obtained from asubject; converting double stranded polynucleotide molecules from theinitial starting genetic material into at least one set of non-uniquelytagged parent polynucleotides, wherein each polynucleotide in a set ismappable to a reference sequence; and for each set of tagged parentpolynucleotides: (i) amplifying the tagged parent polynucleotides in theset to produce a corresponding set of amplified progeny polynucleotides;(ii) sequencing the set of amplified progeny polynucleotides to producea set of sequencing reads; (iii) collapsing the set of sequencing readsto generate a set of consensus sequences, wherein collapsing usessequence information from a tag and at least one of: (1) sequenceinformation at a beginning region of a sequence read, (2) an end regionof the sequence read and (3) length of the sequence read, wherein eachconsensus sequence of the set of consensus sequences corresponds to apolynucleotide molecule among the set of tagged parent polynucleotides;and (iv) analyzing the set of consensus sequences for each set of taggedparent molecules; (v) comparing current sequence reads with priorsequence reads from at least one other time point; and (vi) updating adiagnostic confidence indication accordingly, which diagnosticconfidence indication is indicative of a probability of identifying oneor more genetic variations in a bodily sample of the subject.

In some embodiments, the bodily sample is a blood sample. In someembodiments, the initial starting genetic material includes cell-freedeoxyribonucleic acid (DNA).

In some embodiments, the set of consensus sequences for each set oftagged parent molecules is analyzed separately.

In some embodiments, analyzing comprises detecting mutations, indels,copy number variations, transversions, translocations, inversion,deletions, aneuploidy, partial aneuploidy, polyploidy, chromosomalinstability, chromosomal structure alterations, gene fusions, chromosomefusions, gene truncations, gene amplification, gene duplications,chromosomal lesions, DNA lesions, abnormal changes in nucleic acidchemical modifications, abnormal changes in epigenetic patterns,abnormal changes in nucleic acid methylation infection or cancer.

In some embodiments, (vi) comprises increasing diagnostic confidenceindication in the current sequence reads if information from the priorsequence reads corroborates information from the current sequence reads.In some embodiments, (vi) comprises decreasing a diagnostic confidenceindication in the current sequence reads if information from the priorsequence reads conflicts with information from the current sequencereads. In some embodiments, (vi) comprises keeping a diagnosticconfidence indication the same in the current sequence reads ifinformation from the prior sequence reads is inconclusive with respectto information from the current sequence reads.

In some embodiments, (v) comprises comparing one or more currentsequence read variations with one or more prior sequence readvariations.

In another aspect, the present disclosure provides a method fordetecting one or more genetic variants in a subject, comprising: (a)obtaining nucleic acid molecules from one or more cell-free biologicalsamples of said subject; (b) assaying said nucleic acid molecules toproduce a first set of genetic data and a second set of genetic data,wherein said first set of genetic data and/or said second set of geneticdata is within a detection threshold; (c) comparing said first set ofgenetic data to said second set of genetic data to identify said one ormore genetic variants in said first set of genetic data or said secondset of genetic data; and (d) based on said one or more genetic variantsidentified in (c), using one or more programmed computer processors toupdate a diagnostic confidence indication for identifying said one ormore genetic variants in a cell-free biological sample of said subject.

In some embodiments, said first set of genetic data and said second setof genetic data are within said detection threshold. In someembodiments, said first set of genetic data is within said detectionthreshold and said second set of genetic data is above said detectionthreshold. In some embodiments, said detection threshold is a noisethreshold.

In some embodiments, the method further comprises identifying said oneor more genetic variants in said first set of genetic data, andincreasing said diagnostic confidence indication.

In some embodiments, subsets of said nucleic acid molecules are assayedat different time points. In some embodiments, said nucleic acidmolecules are obtained from a plurality of cell-free biological samplesat the same time point or different time points.

In some embodiments, said nucleic acid molecules are deoxyribose nucleicacid (DNA). In some embodiments, said DNA is cell-free DNA (cfDNA).

In some embodiments, the method further comprises generating a geneticprofile for said subject, wherein said genetic profile comprises saiddiagnostic confidence indication for identifying said one or moregenetic variants.

In some embodiments, a co-variate variant is identified in said firstset of genetic data in (c), and further comprising updating saiddiagnostic confidence indication for identifying a second co-variatevariant in a cell-free biological sample of said subject. In someembodiments, the method further comprises increasing said diagnosticconfidence indication in (c) if said first set of genetic data isobserved in said second set of genetic data. In some embodiments, themethod further comprises decreasing said diagnostic confidenceindication in (c) if said first set of genetic data differs from saidsecond set of genetic data.

In some embodiments, said detection threshold comprises errorsintroduced by sequencing or amplification.

In some embodiments, said detection threshold comprises a per-base errorrate of 0.5% to 5%. In some embodiments, said detection thresholdcomprises a per-base error rate of 0.5% to 1%.

In some embodiments, said nucleic acid molecules are obtained from asecond cell-free biological sample of said subject. In some embodiments,said second cell-free biological sample is obtained after obtaining saidcell-free biological sample of (a). In some embodiments, said secondcell-free biological sample is obtained prior to obtaining saidcell-free biological sample of (a). In some embodiments, said secondcell-free biological sample is obtained concurrent with obtaining saidcell-free biological sample of (a). In some embodiments, said first setof genetic data corresponds to said cell-free biological sample of (a)and said second set of genetic data corresponds to said second cell-freebiological sample.

In some embodiments, the method further comprises: attaching tags tosaid nucleic acid molecules to generate tagged parent polynucleotides;amplifying said tagged parent polynucleotides to produce tagged progenypolynucleotides; and sequencing said tagged progeny polynucleotides toproduce sequencing reads.

In some embodiments, the attaching comprises uniquely tagging thenucleic acid molecules. In some embodiments, the attaching comprisesnon-uniquely tagging said nucleic acid molecules such that no more than5% of said nucleic acid molecules are uniquely tagged.

In some embodiments, the method further comprises selectively enrichingsequences of interest prior to the sequencing.

In some embodiments, the method further comprises grouping said sequencereads into families based at least on a sequence tag. In someembodiments, grouping the sequence reads is further based on one or moreof: sequence information at a beginning of a sequence read derived fromthe nucleic acid molecule, sequence information at an end of saidsequence derived from the nucleic acid molecule, and a length of saidsequence read.

In some embodiments, the method further comprises comparing the sequencereads grouped within each family to determine consensus sequences foreach family, wherein each of the consensus sequences corresponds to aunique polynucleotide among the tagged parent polynucleotides.

In some embodiments, the method further comprises obtaining less than100 ng of the nucleic acid molecules.

In another aspect, the present disclosure provides a method for callinga genetic variant in cell-free deoxyribose nucleic acids (cfDNA) from asubject comprising: (a) using a DNA sequencing system to sequence cfDNAfrom a sample taken at a first time point from a subject; (b) detectinga genetic variant in the sequenced cfDNA from the first time point,wherein the genetic variant is detected at a level below a diagnosticlimit; (c) using the DNA sequencing system to sequence cfDNA from asample taken from the subject at one or more subsequent time points; (d)detecting the genetic variant in the sequenced cfDNA from the one ormore subsequent time points, wherein the genetic variant is detected atlevel below the diagnostic limit; (e) calling the samples as positivefor the genetic variant based on detecting the genetic variant below thediagnostic limit in samples taken at a plurality of the time points.

In some embodiments, the method further comprises (f) detecting a trend,wherein, at the first time point, the genetic variant is detected belowthe diagnostic limit and called as positive, and, at one or moresubsequent time points, the genetic variant is detected above thediagnostic limit whereby the genetic variant is increasing.

In some embodiments, the diagnostic limit is less than or equal to about1.0%.

In another aspect, the present disclosure provides a method for callinga genetic variant in cell-free deoxyribose nucleic acids (cfDNA) from asubject comprising: (a) using a deoxyribonucleic acid (DNA) sequencingsystem to sequence cfDNA from a sample from a subject; (b) detecting agenetic variant in the sequenced cfDNA, wherein the genetic variant isdetected at a level below a diagnostic limit; (c) using the DNAsequencing system to sequence cfDNA from the sample taken from thesubject, wherein the sample is re-sequenced one or more times; (d)detecting the genetic variant in the sequenced cfDNA from the one ormore re-sequenced samples, wherein the genetic variant is detected atlevel below the diagnostic limit; and (e) calling the samples aspositive for the genetic variant based on detecting the genetic variantbelow the diagnostic limit in re-sequenced samples.

In another aspect, the present disclosure provides a non-transitorycomputer readable medium comprising machine-executable code that, uponexecution by one or more computer processors, implements any of themethods above or elsewhere herein.

In another aspect, the present disclosure provides a computer systemcomprising one or more computer processors and memory coupled thereto.The memory comprises a non-transitory computer readable mediumcomprising machine-executable code that, upon execution by the one ormore computer processors, implements any of the methods above orelsewhere herein.

Additional aspects and advantages of the present disclosure will becomereadily apparent to those skilled in this art from the followingdetailed description, wherein only illustrative embodiments of thepresent disclosure are shown and described. As will be realized, thepresent disclosure is capable of other and different embodiments, andits several details are capable of modifications in various obviousrespects, all without departing from the disclosure. Accordingly, thedrawings and description are to be regarded as illustrative in nature,and not as restrictive.

INCORPORATION BY REFERENCE

All publications, patents, and patent applications mentioned in thisspecification are herein incorporated by reference to the same extent asif each individual publication, patent, or patent application wasspecifically and individually indicated to be incorporated by reference.

BRIEF DESCRIPTION OF THE DRAWINGS

The novel features of the disclosure are set forth with particularity inthe appended claims. A better understanding of the features andadvantages of the present disclosure will be obtained by reference tothe following detailed description that sets forth illustrativeembodiments, in which the principles of the disclosure are utilized, andthe accompanying drawings of which:

FIGS. 1A-1D illustrate exemplary systems to reduce error rates and biasin DNA sequence readings.

FIG. 2 illustrates an exemplary process for analyzing polynucleotides ina sample of initial genetic material.

FIG. 3 illustrates another exemplary process for analyzingpolynucleotides in a sample of initial genetic material.

FIG. 4 illustrates another exemplary process for analyzingpolynucleotides in a sample of initial genetic material.

FIGS. 5A and 5B show schematic representations of internet enabledaccess of reports generated from copy number variation analysis of asubject with cancer.

FIG. 6 shows a schematic representation of internet enabled access ofreports of a subject with cancer.

FIG. 7 illustrates a computer system programmed or otherwise configuredto analyze genetic data.

FIG. 8 shows detection of sequences in a sample spiked with nucleicacids bearing cancer mutants.

FIG. 9 shows a gene panel that may be used with methods and systems ofthe present disclosure.

DETAILED DESCRIPTION

While various embodiments of the invention have been shown and describedherein, it will be obvious to those skilled in the art that suchembodiments are provided by way of example only. Numerous variations,changes, and substitutions may occur to those skilled in the art withoutdeparting from the invention. It should be understood that variousalternatives to the embodiments of the invention described herein may beemployed.

The terminology used herein is for the purpose of describing particularembodiments only and is not intended to be limiting of the invention. Asused herein, the singular forms “a”, “an” and “the” are intended toinclude the plural forms as well, unless the context clearly indicatesotherwise. Furthermore, to the extent that the terms “including”,“includes”, “having”, “has”, “with”, or variants thereof are used ineither the detailed description and/or the claims, such terms areintended to be inclusive in a manner similar to the term “comprising”.

The term “about” or “approximately” means within an acceptable errorrange for the particular value as determined by one of ordinary skill inthe art, which will depend in part on how the value is measured ordetermined, i.e., the limitations of the measurement system. Forexample, “about” can mean within 1 or more than 1 standard deviation,per the practice in the art. Alternatively, “about” can mean a range ofup to 20%, up to 10%, up to 5%, or up to 1% of a given value.Alternatively, particularly with respect to biological systems orprocesses, the term can mean within an order of magnitude, such aswithin 5-fold or within 2-fold, of a value. Where particular values aredescribed in the application and claims, unless otherwise stated theterm “about” meaning within an acceptable error range for the particularvalue should be assumed.

In certain embodiments, diagnostics involve detecting (e.g., measuring)a signal indicative of disease, such as a biomarker, and correlating thedetection or measurement with a disease state. However, a signal may beweak due to low sample concentration or it may be obscured by noise. Ifthe signal is weak such that it is at or below a noise threshold ordetection limit, it may be difficult to differentiate signal from noiseproduced by the detection system or detect the signal at all. In suchcases, one may not be confident in making a diagnosis. By looking atgenetic data or detected variations from a plurality of points in time,a plurality of tests as confirmatory signals, or a plurality of commonlydetected co-variate genetic variants, the diagnostic confidence can beenhanced.

The term detection limit and diagnostic limit, as used herein, generallyrefer to the capability to detect the presence or absence, or amount, ofa given gene or variant at a predetermined level of confidence. Adetection threshold as generally used herein refers to a range at orbelow the detection limit where certain genetic variants areundetectable or may not be differentiated from noise. In some instances,a “detection limit” may be lowest frequency or concentration at which avariant is detected in a variant-positive sample 95% of the time. Adiagnostic limit may be the lowest frequency at which a positive callcan be made. A diagnostic limit may be from about 0.01% to about 1%. Adiagnostic limit may be less than or equal to about 5%, about 1.0%,about 0.8%, about 0.5%, about 0.25%, about 0.1%, about 0.08%, about0.05%, about 0.03%, about 0.01%, or less. In some instances, thedetection limit may be the same as the diagnostic limit. The detectionlimit or diagnostic limit may be a noise limit or noise threshold. Insuch a scenario, the detection limit or diagnostic limit is the limit atwhich signal may not be differentiated from noise.

In some instances, the diagnostic limit may be lower than the detectionlimit. Using methods and systems described herein, a genetic variant(s)present in an amount at or below the detection limit may be positivelycalled at a predetermined level of confidence (e.g., at least 80%, 90%,or 95% confidence), even when the genetic variant(s) is present at orbelow a detection limit.

So, for example, sequence analysis of a sample may reveal a number ofdifferent genetic variants and a variety of frequencies orconcentrations in the sample. The diagnostic limit may be set by aclinician at, for example, 1%, which is to say, no variant is to bereported as “present” in the sample, or “called” in a report unless thevariant is present at a concentration of at least 1%. If a first variantis detected at 5%, that variant is “called” present in the sample andreported. Another variant is detected at 0.5%. This is below thediagnostic limit, and may be below the detection limit of the sequencingsystem. In this case, the clinician has several options. First, the samesample may be re-tested. If the variant is again detected, below orabove the detection limit, it is now “called” as present in the sample.Second, the sequence data can be examined for the presence of aco-variate variation. For example, the variant may be a known resistancemutation. If a driver mutation is detected in the same gene from thesequence data, this also indicates that the resistance mutant is likelynot to be a “noise” detection and, again, a positive call can be made.Third, the subject can be tested again at a later time point. If thevariant is detected is the later sample, the first sample can be calledas “present” for the variant. Alternatively, if a subsequent test showan amount of the variant with a confidence score that does not overlapwith the first test, the variant can be called as increasing ordecreasing in the subject, as the case may be.

Several factors may affect the ability to detect genes or variants at ornear the detection or diagnostic limit. Detected genes or variants maybe present at a low amounts or concentrations such that it a sequenceanalyzer cannot detect a gene or variant. For example, out of onemillion analyzed cell-free nucleic acid molecules, a genetic mutationmay be present in one analyzed cell-free nucleic acid molecule, thus thevariant base call exists at a frequency of one-in-million. A sequencinganalyzer may mischaracterize the genetic mutation as a non-variant basecall because the genetic mutation occurs with a low frequency relativeto all other base calls at the same site. In such instances, a detectionlimit may generally refer to the ability of a genetic analyzer orsequencer to detect genetic variations present at very low frequencies.Additionally, sequence errors or artifacts introduced from sequencing oramplification can make it difficult or impossible to differentiatebetween errors and/or artifacts and detected genes or geneticvariations. In such instances, a detection limit may refer to theability to distinguish between variant base calls and error calls withconfidence. The present disclosure provides technique(s) for detectinggenetic variations at or below the detection limit and/or within adetection threshold.

The term “diagnostic confidence indication” as used herein generallyrefers to a representation, a number, a rank, a score, a degree or avalue assigned to indicate the presence of one or more genetic variantsand how much that presence is trusted. A diagnostic confidenceindication may be indicative of a probability of identifying one or moregenetic variations in a biological sample of the subject. For example,the representation can be a binary value or an alphanumeric ranking fromA-Z, among others. In yet another example, the diagnostic confidenceindication can have any value from 0 to 100, among others. In yetanother example, the diagnostic confidence indication can be representedby a range or degree, e.g., “low” or “high”, “more” or “less”,“increased” or “decreased”. A low diagnostic confidence indicationindicate that a detected genetic variant may be noise (e.g., that thedetected presence of the genetic variant cannot be trusted too much). Ahigh diagnostic confidence indication means that, for a detected geneticvariant, the genetic variant is likely to exist. In some instances, aresult may be untrusted if its diagnostic confidence indication is under25-30 out of 100.

The diagnostic confidence indication for each variant can be adjusted toindicate a confidence of predicting a genetic variation. The confidencecan be increased or decreased by using measurements at a plurality oftime points or from a plurality of samples at the same time point or atdifferent time points. The diagnostic confidence can be further adjustedbased on the detection of co-variate variations. The diagnosticconfidence indication can be assigned by any of a number of statisticalmethods and can be based, at least in part, on the frequency at whichthe measurements are observed over a period of time.

The term “co-variate variations” or “co-variate variants”, as usedherein, generally refers to genetic variations that tend to varytogether, for example, the presence of one variation is correlated withthe presence of the co-variate variation. Accordingly, if a variant isseen below the diagnostic limit or the detection limit, and a co-variatevariant is also detected, either above or below the detection limit,then it is more likely that the sample is positive for both variants,and they can be “called” as present in the sample. One example ofco-variate variations are driver mutations and resistance mutations ormutations of unknown significance. That is, after a drive mutation ispresent, other mutations in the same gene, such as resistance mutationsmay appear, especially after treatment and recurrence of a cancer. As anon-limiting example, a driver mutation may be detected above thedetection limit with high diagnostic confidence. However, due toinsufficient sampling or noise, it may be difficult to confidentlyassess whether another genetic variation is present. If the geneticvariation is typically present with the driver mutation such that thevariants are co-variate variants (such as a passenger mutation or aresistance mutation), the diagnostic confidence indication of thegenetic variant will increase. The strength of association betweencertain variants detected together can increase the probability,likelihood, and/or confidence that genetic data detected below adetection limit is a genetic variation.

The term “DNA sequencing system”, as used herein, generally refers toDNA sample preparation protocols used in conjunction with a sequencinginstrument. DNA sample preparation protocols may be directed to librarypreparation, amplification, adapter ligation, single strand elongation,among other molecular biological methods. A sequencing instrument may beany instrument capable of automating various sequencing methods orprocesses. Non-limiting examples of various sequencing methods orprocesses include: Sanger sequencing, high-throughput sequencing,pyrosequencing, sequencing-by-synthesis, single-molecule sequencing,nanopore sequencing, semiconductor sequencing, sequencing-by-ligation,sequencing-by-hybridization, RNA-Seq (Illumina), Digital Gene Expression(Helicos), Next generation sequencing, Single Molecule Sequencing bySynthesis (SMSS)(Helicos), massively-parallel sequencing, Clonal SingleMolecule Array (Solexa), shotgun sequencing, Maxim-Gilbert sequencing,primer walking, and any other sequencing methods recognized in the art.A DNA sequencing system may comprise all protocols to prepare samplesfor sequencing in a particular sequencing instrument.

The term “subject,” as used herein, generally refers to any organismthat is used in the methods of the disclosure. In some examples, asubject is a human, mammal, vertebrate, invertebrate, eukaryote,archaea, fungus, or prokaryote. In some instances, a subject can be ahuman. A subject can be living or dead. A subject can be a patient. Forexample, a subject may be suffering from a disease (or suspected ofsuffering from a disease) and/or in the care of a medical practitioner.A subject can be an individual that is undergoing treatment and/ordiagnosis for a health or medical condition. A subject and/or familymember can be related to another subject used in the methods of thedisclosure (e.g., a sister, a brother, a mother, a father, a nephew, aniece, an aunt, an uncle, a grandparent, a great-grandparent, a cousin).

The term “nucleic acid,” as used herein, generally refers to a moleculecomprising one or more nucleic acid subunits. A nucleic acid can includeone or more subunits selected from adenosine (A), cytosine (C), guanine(G), thymine (T) and uracil (U), or variants thereof. A nucleotide caninclude A, C, G, T or U, or variants thereof. A nucleotide can includeany subunit that can be incorporated into a growing nucleic acid strand.Such subunit can be an A, C, G, T, or U, or any other subunit that isspecific to one or more complementary A, C, G, T or U, or complementaryto a purine (i.e., A or G, or variant thereof) or a pyrimidine (i.e., C,T or U, or variant thereof). A subunit can enable individual nucleicacid bases or groups of bases (e.g., AA, TA, AT, GC, CG, CT, TC, GT, TG,AC, CA, or uracil-counterparts thereof) to be resolved. In someexamples, a nucleic acid is deoxyribonucleic acid (DNA) or ribonucleicacid (RNA), or derivatives thereof. A nucleic acid can besingle-stranded or double stranded.

The term “genome” generally refers to an entirety of an organism'shereditary information. A genome can be encoded either in DNA or in RNA.A genome can comprise coding regions that code for proteins as well asnon-coding regions. A genome can include the sequence of all chromosomestogether in an organism. For example, the human genome has a total of 46chromosomes. The sequence of all of these together constitutes the humangenome.

The term “sample,” as used herein, generally refers to a biologicalsample. A sample may be or include blood, serum, plasma, vitreous,sputum, urine, tears, perspiration, saliva, semen, mucosal excretions,mucus, spinal fluid, amniotic fluid, lymph fluid and the like. A samplemay be a cell-free sample. A sample may include nucleic acid molecules,such as polynucleotides. Polynucleotides may be deoxyribonucleic acid(DNA) or ribonucleic acid (RNA). Cell free polynucleotides may be fetalin origin (via fluid taken from a pregnant subject), or may be derivedfrom tissue of the subject itself.

Detection Limit/Noise Range

Polynucleotide sequencing can be compared with a problem incommunication theory. An initial individual polynucleotide or ensembleof polynucleotides can be conceptualized as an original message. Taggingand/or amplifying can be thought of as encoding the original messageinto a signal. Sequencing can be thought of as communication channel.The output of a sequencer, e.g., sequence reads, can be thought of as areceived signal. Bioinformatic processing can be thought of as areceiver that decodes the received signal to produce a transmittedmessage, e.g., a nucleotide sequence or sequences. The received signalcan include artifacts, such as noise and distortion. Noise can bethought of as an unwanted random addition to a signal. Distortion can bethought of as an alteration in the amplitude of a signal or portion of asignal.

Noise can be introduced through errors in copying and/or reading apolynucleotide. For example, in a sequencing process, a singlepolynucleotide can first be subject to amplification. Amplification canintroduce errors, so that a subset of the amplified polynucleotides maycontain, at a particular locus, a base that is not the same as theoriginal base at that locus. Furthermore, in the reading process a baseat any particular locus may be read incorrectly. As a consequence, thecollection of sequence reads can include a certain percentage of basecalls at a locus that are not the same as the original base. In typicalsequencing technologies this error rate can be in the single digits,e.g., 2%-3%. In some instances, the error rate can be up to about 10%,up to about 9%, up to about 8%, up to about 7%, up to about 6%, up toabout 5%, up to about 4%, up to about 3%, up to about 2%, or up to about1%. When a collection of molecules that are all presumed to have thesame sequence are sequenced, this noise may be sufficiently small thatone can identify the original base with high reliability.

However, if a collection of parent polynucleotides includes a subset ofpolynucleotides having that vary at a particular locus, noise can be asignificant problem. This can be the case, for example, when cell-freeDNA includes not only germline DNA, but DNA from another source, such asfetal DNA or DNA from a cancer cell. In this case, if the frequency ofmolecules with sequence variants may be in the same range as thefrequency of errors introduced by the sequencing process, then truesequence variants may not be distinguishable from noise. This couldinterfere, for example, with detecting sequence variants in a sample.For example, sequences can have a per-base error rate of 0.5-1%.Amplification bias and sequencing errors introduce noise into the finalsequencing product. This noise can diminish sensitivity of detection. Asa non-limiting example, sequence variants whose frequency is less thanthe sequencing error rate can be mistaken for noise.

A noise range or detection limit refers to instances where the frequencyof molecules with sequence variants is in the same range as thefrequency of errors introduced by the sequencing process. A “detectionlimit” may also refer to instances where too few variant-carryingmolecules are sequenced for the variant to be detected. The frequency ofmolecules with sequence variants may be in the same range as thefrequency of errors as a result of a small amount of nucleic acidmolecules. As a non-limiting example, a sampled amount of nucleic acids,e.g. 100 ng, may contain a relatively small number of cell-free nucleicacid molecules, e.g. circulating tumor DNA molecules, such that thefrequency of a sequence variant may be low, even though the variant maybe present in a majority of circulating tumor DNA molecules.Alternately, the sequence variant may be rare or occur in only a verysmall amount of the sampled nucleic acids such that a detected variantis indistinguishable from noise and/or sequencing error. As anon-limiting example, at a particular locus, a genetic variant may onlybe detected in 0.1% to 5% of all reads at that locus.

Distortion can be manifested in the sequencing process as a differencein signal strength, e.g., total number of sequence reads, produced bymolecules in a parent population at the same frequency. Distortion canbe introduced, for example, through amplification bias, GC bias, orsequencing bias. This could interfere with detecting copy numbervariation in a sample. GC bias results in the uneven representation ofareas rich or poor in GC content in the sequence reading. Also, byproviding reads of sequences in greater or less amounts than theiractual number in a population, amplification bias can distortmeasurements of copy number variation.

Sequencing and/or amplification artifacts or errors, such as noiseand/or distortion, may be reduced in a polynucleotide sequencingprocess. Sequencing and/or amplification artifacts or errors may bereduced using a wide variety of techniques for sequencing and sequenceanalysis. Various techniques may include sequencing methodologies and/orstatistical methods.

One way to reduce noise and/or distortion is to filter sequence reads.As a non-limiting example, sequence reads may be filtered by requiringsequence reads to meet a quality threshold, or by reducing GC bias. Suchmethods typically are performed on the collection of sequence reads thatare the output of a sequencer, and can be performed sequenceread-by-sequence read, without regard for family structure(sub-collections of sequences derived from a single original parentmolecule).

Another way to reduce noise and/or distortion from a single individualmolecule or from an ensemble of molecules is to group sequence readsinto families derived from original individual molecules to reduce noiseand/or distortion from a single individual molecule or from an ensembleof molecules. Efficient conversion of individual polynucleotides in asample of initial genetic material into sequence-ready tagged parentpolynucleotides may increase the probability that individualpolynucleotides in a sample of initial genetic material will berepresented in a sequence-ready sample. This can produce sequenceinformation about more polynucleotides in the initial sample.Additionally, high yield generation of consensus sequences for taggedparent polynucleotides by high-rate sampling of progeny polynucleotidesamplified from the tagged parent polynucleotides, and collapsing ofgenerated sequence reads into consensus sequences representing sequencesof parent tagged polynucleotides can reduce noise introduced byamplification bias and/or sequencing errors, and can increasesensitivity of detection. Collapsing sequence reads into a consensussequence is one way to reduce noise in the received message from onemolecule. Using probabilistic functions that convert receivedfrequencies is another way to reduce noise and/or distortion. Withrespect to an ensemble of molecules, grouping reads into families anddetermining a quantitative measure of the families reduces distortion,for example, in the quantity of molecules at each of a plurality ofdifferent loci. Again, collapsing sequence reads of different familiesinto consensus sequences eliminate errors introduced by amplificationand/or sequencing error. Furthermore, determining frequencies of basecalls based on probabilities derived from family information alsoreduces noise in the received message from an ensemble of molecules.

Noise and/or distortion may be further reduced by comparing geneticvariations in a sequence read with genetic variations other sequencereads. A genetic variation observed in one sequence read and again inother sequence reads increases the probability that a detected variantis in fact a genetic variant and not merely a sequencing error or noise.As a non-limiting example, if a genetic variation is observed in a firstsequence read and also observed in a second sequence read, a Bayesianinference may be made regarding whether the variation is in fact agenetic variation and not a sequencing error.

The present disclosure provides methods for detecting variations innucleic acid molecules, particularly those at a frequency within a noiserange or below a detection limit. Variants initially detected in nucleicacid molecules can be compared to other variants, such as for examplevariants at the same locus or co-variate genetic variants, to determinewhether a variant is more or less likely to be accurately detected.Variants may be detected in amplified nucleic acid molecules, detectedin sequence reads or collapsed sequence reads.

Repeated detection of a variant may increase the probability,likelihood, and/or confidence that a variant is accurately detected. Avariant can be repeatedly detected by comparing two or more sets ofgenetic data or genetic variations. The two or more sets of geneticvariations can be both samples at multiple time points and differentsamples at the same time point (for example a re-analyzed blood sample).In detecting a variant in the noise range or below the noise threshold,the re-sampling or repeated detection of a low frequency variant makesit more likely that the variant is in fact a variant and not asequencing error. Re-sampling can be from the same sample, such as asample that is re-analyzed or re-run, or from samples at different timepoints.

As a non-limiting example, a genetic variant having a low confidencescore may be detected at a frequency or amount below the detection limitor noise range. However, if the genetic variant is observed again, suchas for example at a later time point, in a prior sample, or uponre-analyzing a sample, the confidence score may increase. Thus, variantmay be detected with greater confidence despite being present in afrequency or amount below the detection limit or noise range. In otherinstances, where the genetic variant is not observed again upon, forexample, re-sampling, a confidence score may remain constant ordecrease. Alternately, if a genetic variant observed at a particularlocus conflicts a re-sampled result, the confidence score may decrease.

Co-variate detection may increase the probability, likelihood, and/orconfidence that a variant is accurately detected. For co-variate geneticvariants, the presence of one genetic variant is associated with thepresence of one or more other genetic variants. Based on the detectionof a co-variate genetic variation, it may be possible to infer thepresence of an associated co-variate genetic variation, even where theassociated genetic variation is present below a detection limit.Alternately, based on the detection of a co-variate genetic variation,the diagnostic confidence indication for the associated geneticvariation may be increased. Further, in some instances where aco-variate variant is detected, a detection threshold for a co-variatevariant detected below a detection limit may be decreased. Non-limitingexamples of co-variate variations or genes include: driver mutations andresistance mutations, driver mutations and passenger mutations. Asspecific example of co-variants or genes is EGFR L858R activatingmutation and EG1-R T790M resistance mutation, found in lung cancers.Numerous other co-variate variants and genes are associated with variousresistance mutations and will be recognized by one having skill in theart.

The present disclosure provides methods for detecting genetic variantswhere at least some variants are in the noise range or threshold. In thenoise threshold or range, it may be difficult or impossible or difficultto detect genetic variations with confidence. In some instances, a noisethreshold provides a limit for detecting genetic variation withstatistical confidence. The noise threshold or range may overlap with asequencing error rate. The noise threshold may be the same as thesequencing error rate. The noise threshold may be lower than thesequencing error rate. The noise threshold may be up to about 10%, up toabout 9%, up to about 8%, up to about 7%, up to about 6%, up to about5%, up to about 4%, up to about 3%, up to about 2%, or up to about 1%.In some instances, the noise range is about 0.5% to 10% errors per base.In some instances, the noise threshold is about 0.5% to 5% errors perbase. In some instances, the noise threshold is about 0.5% to 1% errorsper base. The terms noise and threshold may be used interchangeably.

Several types of genetic variants may be detected in nucleic acidmolecules. Genetic variations may be interchangeably referred to asgenetic variants or genetic aberrations. Genetic variations may includea single base substitution, a copy number variation, an indel and a genefusion. A combination of these genetic variants may be detected.Non-limiting examples of additional genetic variants may also include: atransversion, a translocation, an inversion, a deletion, aneuploidy,partial aneuploidy, polyploidy, chromosomal instability, chromosomalstructure alterations, chromosome fusions, a gene truncation, a geneamplification, a gene duplication, a chromosomal lesion, a DNA lesion,abnormal changes in nucleic acid chemical modifications, abnormalchanges in epigenetic patterns and abnormal changes in nucleic acidmethylation.

In one implementation, using measurements from a plurality of samplescollected substantially at once or over a plurality of time points, thediagnostic confidence indication for each variant can be adjusted toindicate a confidence of predicting the observation of the copy numbervariation (CNV) or mutation. The confidence can be increased by usingmeasurements at a plurality of time points to determine whether canceris advancing, in remission or stabilized. The diagnostic confidenceindication can be assigned by any of a number of statistical methods andcan be based, at least in part, on the frequency at which themeasurements are observed over a period of time. For example, astatistical correlation of current and prior results can be done.Alternatively, for each diagnosis, a hidden Markov model can be built,such that a maximum likelihood or maximum a posteriori decision can bemade based on the frequency of occurrence of a particular test eventfrom a plurality of measurements or a time points. As part of thismodel, the probability of error and resultant diagnostic confidenceindication for a particular decision can be output as well. In thismanner, the measurements of a parameter, whether or not they are in thenoise range, may be provided with a confidence interval. Tested overtime, one can increase the predictive confidence of whether a cancer isadvancing, stabilized or in remission by comparing confidence intervalsover time. Two sampling time points can be separated by at least about 1microsecond, 1 millisecond, 1 second, 10 seconds, 30 seconds, 1 minute,10 minutes, 30 minutes, 1 hour, 12 hours, 1 day, 1 week, 2 weeks, 3weeks, one month, or one year. Two time points can be separated by abouta month to about a year, about a year to about 5 years, or no more thanabout three months, two months, one month, three weeks, two weeks, oneweek, one day, or twelve hours.

FIG. 1A shows a first exemplary system to reduce error rates and biasthat can be orders of magnitude higher than what is required to reliablydetect de novo genomic alterations associated with cancer. The processfirst captures genetic information by collecting body fluid samples assources of genetic material (blood, saliva, sweat, among others) andthen the process sequences the materials (1). For example,polynucleotides in a sample can be sequenced, producing a plurality ofsequence reads. The tumor burden in a sample that comprisespolynucleotides can be estimated as a ratio of the relative number ofsequence reads bearing a variant, to the total number of sequence readsgenerated from the sample. Also, in the case of copy number variants,the tumor burden can be estimated as the relative excess (in the case ofgene duplication) or relative deficit (in the case of gene elimination)of total number of sequence reads at test and control loci. So, forexample, a run may produce 1000 reads mapping to an oncogene locus, ofwhich 900 correspond to wild type and 100 correspond to a cancer mutant,indicating a tumor burden of 10%. More details on exemplary collectionand sequencing of the genetic materials are discussed below in FIGS.2-4.

Next, genetic information is processed (2). Genetic variants are thenidentified. The variants can be a single-nucleotide polymorphism (SNP),in case it is a common genetic variant, a mutation, in a case where itis a rare genetic variant, or a copy-number variation, for example. Theprocess then determines the frequency of genetic variants in the samplecontaining the genetic material. Since this process is noisy, theprocess separates information from noise (3).

The sequencing methods have error rates. For example, the mySeq systemof Illumina can produce percent error rates in the low single digits.Thus, for 1000 sequence reads mapping to a locus, one might expect about50 reads (about 5%) to include errors. Certain methodologies, such asthose described in WO 2014/149134 (Talasaz and Eltoukhy), which isentirely incorporated herein by reference, can significantly reduce theerror rate. Errors create noise that can obscure signals from cancerpresent at low levels in a sample. Thus, if a sample has a tumor burdenat a level around the sequencing system error rate, e.g., around0.1%-5%, it may be difficult to distinguish a signal corresponding to agenetic variant due to cancer from one due to noise.

Diagnosis of cancer can be done by analyzing the genetic variants, evenin the presence of noise. The analysis can be based on the frequency ofSequence Variants or Level of CNV (4) and a diagnosis confidenceindication or level for detecting genetic variants in the noise rangecan be established (5).

Next, the process increases the diagnosis confidence. This can be doneusing a plurality of measurements to increase confidence of Diagnosis(6), or alternatively using measurements at a plurality of time pointsto determine whether cancer is advancing, in remission or stabilized (7)

The diagnostic confidence can be used to identify disease states. Forexample, cell free polynucleotides taken from a subject can includepolynucleotides derived from normal cells, as well as polynucleotidesderived from diseased cells, such as cancer cells. Polynucleotides fromcancer cells may bear genetic variants, such as somatic cell mutationsand copy number variants. When cell free polynucleotides from a samplefrom a subject are sequenced, these cancer polynucleotides are detectedas sequence variants or as copy number variants. The relative amount oftumor polynucleotides in a sample of cell free polynucleotides isreferred to as the “tumor burden.”

Measurements of a parameter, whether or not they are in the noise range,may be provided with a confidence interval. Tested over time, one candetermine whether a cancer is advancing, stabilized or in remission bycomparing confidence intervals over time. Where the confidence intervalsdo not overlap, this indicates the direction of disease.

FIG. 1B shows a second exemplary system to reduce error rates and biasthat can be orders of magnitude higher than what is required to reliablydetect de novo genomic alterations associated with cancer. This is doneby generating a sequence read by a genetic analyzer, e.g., a DNAsequencer from a specimen (10). The system then characterizes thesubject's genetic information over two or more samples or time points(12). Next, the system uses the information from the two or moresampling points or time points to produce an adjusted test result incharacterizing the subject's genetic information (14).

The test result can be adjusted by enhancing or negating the confidenceindication. For example, the process includes increasing a diagnosticconfidence indication in a subsequent or a previous characterization ifthe information from a first time point corroborates information fromthe second time point. Alternatively, the process can increase adiagnostic confidence indication in the subsequent characterization ifthe information from a first time point corroborates information fromthe second time point. The diagnostic confidence indication in thesubsequent characterization can be decreased if the information from afirst time point conflicts with information from the second time point.Alternatively, the process can leave as is a diagnostic confidenceindication in the subsequent characterization for de novo information.

In one embodiment of FIG. 1B, the system compares current sequence readsby a genetic analyzer, e.g., a DNA sequencer with prior sequence readsand updates a diagnostic confidence indication accordingly. Based on theenhanced confidence signal, the system accurately generates a geneticprofile of extracellular polynucleotides in the subject, wherein thegenetic profile comprises a plurality of data resulting from copy numbervariation and/or mutation analyses.

FIG. 1C shows a third exemplary system to reduce error rates and biasthat can be orders of magnitude higher than what is required to reliablydetect de novo genomic alterations associated with cancer. As anon-limiting example, the system performs cancer detection by sequencingof cell-free nucleic acid, wherein at least a portion of each gene in apanel of at least any of 10, 25, 50 or 100 genes is sequenced (20);comparing current sequence reads with prior sequence reads and updatinga diagnostic confidence indication accordingly (22). The system thendetects the presence or absence of genetic alteration and/or amount ofgenetic variation in an individual based on the diagnostic confidenceindication of the current sequence read (24).

FIG. 1D shows yet another exemplary system to reduce error rates andbias that can be orders of magnitude higher than what is required toreliably detect de novo genomic alterations associated with cancer. Thesystem performs cancer detection for example by sequencing of cell-freenucleic acid (30); comparing current sequence reads by the DNA sequencerwith prior sequence reads and updating a diagnostic confidenceaccordingly, each consensus sequence corresponding to a uniquepolynucleotide among a set of tagged parent polynucleotides (32); andcreating a genetic profile of extracellular polynucleotides in thesubject, wherein the genetic profile comprises a plurality of dataresulting from copy number variation or rare mutation analyses (34).

The systems of FIGS. 1A-1D detect with high sensitivity geneticvariation in a sample of initial genetic material. The methods involveusing one to three of the following tools: First, the efficientconversion of individual polynucleotides in a sample of initial geneticmaterial into sequence-ready tagged parent polynucleotides, so as toincrease the probability that individual polynucleotides in a sample ofinitial genetic material will be represented in a sequence-ready sample.This can produce sequence information about more polynucleotides in theinitial sample. Second, high yield generation of consensus sequences fortagged parent polynucleotides by high rate sampling of progenypolynucleotides amplified from the tagged parent polynucleotides, andcollapsing of generated sequence reads into consensus sequencesrepresenting sequences of parent tagged polynucleotides. This can reducenoise introduced by amplification bias and/or sequencing errors, and canincrease sensitivity of detection. Third, the noise in the detection ofmutations and copy number variations is reduced by comparing priorsample analysis with the current sample and increasing a diagnosticconfidence indication if the same mutations and copy number variationshave appeared in prior analysis and otherwise decreasing the diagnosticconfidence indication if this is the first time the sequence isobserved.

The system detects with high sensitivity genetic variation in a sampleof initial genetic material. In one specific implementation, the systemoperation includes sample preparation, or the extraction and isolationof cell free polynucleotide sequences from a bodily fluid; subsequentsequencing of cell free polynucleotides by techniques utilized in theart; and application of bioinformatics tools to detect mutations andcopy number variations as compared to a reference. The detection ofmutations and copy number variations is enhanced by comparing priorsample analysis with the current sample and increasing a diagnosticconfidence indication if the same mutations and copy number variationshave appeared in prior analysis and otherwise decreasing or keepunchanged the diagnostic confidence indication if this is the first timethe sequence is observed. The systems and methods also may contain adatabase or collection of different mutations or copy number variationprofiles of different diseases, to be used as additional references inaiding detection of mutations, copy number variation profiling orgeneral genetic profiling of a disease.

After sequencing data of cell free polynucleotide sequences iscollected, one or more bioinformatics processes may be applied to thesequence data to detect genetic features or variations such as copynumber variation, mutations or changes in epigenetic markers, includingbut not limited to methylation profiles. In some cases, in which copynumber variation analysis is desired, sequence data may be: 1) alignedwith a reference genome; 2) filtered and mapped; 3) partitioned intowindows or bins of a sequence; 4) coverage reads counted for eachwindow; 5) coverage reads can then be normalized using a stochastic orstatistical modeling algorithm; and 6) an output file can be generatedreflecting discrete copy number states at various positions in thegenome. In other cases, in which mutation analysis is desired, sequencedata may be 1) aligned with a reference genome; 2) filtered and mapped;3) frequency of variant bases calculated based on coverage reads forthat specific base; 4) variant base frequency normalized using astochastic, statistical or probabilistic modeling algorithm; and 5) anoutput file can be generated reflecting mutation states at variouspositions in the genome. Temporal information from the current and prioranalysis of the patient or subject is used to enhance the analysis anddetermination.

A variety of different reactions and/operations may occur within thesystems and methods disclosed herein, including but not limited to:nucleic acid sequencing, nucleic acid quantification, sequencingoptimization, detecting gene expression, quantifying gene expression,genomic profiling, cancer profiling, or analysis of expressed markers.Moreover, the systems and methods have numerous medical applications.For example, it may be used for the identification, detection,diagnosis, treatment, monitoring, staging of, or risk prediction ofvarious genetic and non-genetic diseases and disorders including cancer.It may be used to assess subject response to different treatments of thegenetic and non-genetic diseases, or provide information regardingdisease progression and prognosis.

Polynucleotide Isolation and Extraction

The systems and methods of this disclosure may have a wide variety ofuses in the manipulation, preparation, identification and/orquantification of nucleic acids including cell free polynucleotides.Examples of nucleic acids or polynucleotides include but are not limitedto: DNA, RNA, amplicons, cDNA, dsDNA, ssDNA, plasmid DNA, cosmid DNA,high Molecular Weight (MW) DNA, chromosomal DNA, genomic DNA, viral DNA,bacterial DNA, mtDNA (mitochondrial DNA), mRNA, rRNA, tRNA, nRNA, siRNA,snRNA, snoRNA, scaRNA, microRNA, dsRNA, ribozyme, riboswitch and viralRNA (e.g., retroviral RNA).

Cell free polynucleotides may be derived from a variety of sourcesincluding human, mammal, non-human mammal, ape, monkey, chimpanzee,reptilian, amphibian, or avian, sources. Further, samples may beextracted from variety of animal fluids containing cell free sequences,including but not limited to blood, serum, plasma, vitreous, sputum,urine, tears, perspiration, saliva, semen, mucosal excretions, mucus,spinal fluid, amniotic fluid, lymph fluid and the like. Cell freepolynucleotides may be fetal in origin (via fluid taken from a pregnantsubject), or may be derived from tissue of the subject itself.

Isolation and extraction of cell free polynucleotides may be performedthrough collection of bodily fluids using a variety of techniques. Insome cases, collection may comprise aspiration of a bodily fluid from asubject using a syringe. In other cases collection may comprisepipetting or direct collection of fluid into a collecting vessel.

After collection of bodily fluid, cell free polynucleotides may beisolated and extracted using a variety of techniques utilized in theart. In some cases, cell free DNA may be isolated, extracted andprepared using commercially available kits such as the Qiagen Qiamp®Circulating Nucleic Acid Kit protocol. In other examples, Qiagen Qubit™dsDNA HS Assay kit protocol, Agilent™ DNA 1000 kit, or TruSeq™Sequencing Library Preparation; Low-Throughput (LT) protocol may beused.

Generally, cell free polynucleotides are extracted and isolated by frombodily fluids through a partitioning step in which cell free DNAs, asfound in solution, are separated from cells and other non-solublecomponents of the bodily fluid. Partitioning may include, but is notlimited to, techniques such as centrifugation or filtration. In othercases, cells are not partitioned from cell free DNA first, but ratherlysed. In this example, the genomic DNA of intact cells is partitionedthrough selective precipitation. Cell free polynucleotides, includingDNA, may remain soluble and may be separated from insoluble genomic DNAand extracted. Generally, after addition of buffers and other wash stepsspecific to different kits, DNA may be precipitated using isopropanolprecipitation. Further clean up steps may be used such as silica basedcolumns to remove contaminants or salts. General steps may be optimizedfor specific applications. Non-specific bulk carrier polynucleotides,for example, may be added throughout the reaction to optimize certainaspects of the procedure such as yield.

Isolation and purification of cell free DNA may be accomplished usingany methodology, including, but not limited to, the use of commercialkits and protocols provided by companies such as Sigma Aldrich, LifeTechnologies, Promega, Affymetrix, IBI or the like. Kits and protocolsmay also be non-commercially available.

After isolation, in some cases, the cell free polynucleotides arepre-mixed with one or more additional materials, such as one or morereagents (e.g., ligase, protease, polymerase) prior to sequencing.

One method of increasing conversion efficiency involves using a ligaseengineered for optimal reactivity on single-stranded DNA, such as aThermoPhage ssDNA ligase derivative. Such ligases bypass traditionalsteps in library preparation of end-repair and A-tailing that can havepoor efficiencies and/or accumulated losses due to intermediate cleanupsteps, and allows for twice the probability that either the sense oranti-sense starting polynucleotide will be converted into anappropriately tagged polynucleotide. It also converts double-strandedpolynucleotides that may possess overhangs that may not be sufficientlyblunt-ended by the typical end-repair reaction. Optimal reactionsconditions for this ssDNA reaction are: 1×reaction buffer (50 mM MOPS(pH 7.5), 1 mM DTT, 5 mM MgCl2, 10 mM KCl). With 50 mM ATP, 25 mg/mlBSA, 2.5 mM MnCl2, 200 pmol 85 nt ssDNA oligomer and 5 U ssDNA ligaseincubated at 65° C. for 1 hour. Subsequent amplification using PCR canfurther convert the tagged single-stranded library to a double-strandedlibrary and yield an overall conversion efficiency of well above 20%.Other methods of increasing conversion rate, e.g., to above 10%,include, for example, any of the following, alone or in combination:Annealing-optimized molecular-inversion probes, blunt-end ligation witha well-controlled polynucleotide size range, sticky-end ligation or anupfront multiplex amplification step with or without the use of fusionprimers.

Molecular Barcoding of Cell Free Polynucleotides

The systems and methods of this disclosure may also enable the cell freepolynucleotides to be tagged or tracked in order to permit subsequentidentification and origin of the particular polynucleotide. This featureis in contrast with other methods that use pooled or multiplex reactionsand that only provide measurements or analyses as an average of multiplesamples. Here, the assignment of an identifier to individual orsubgroups of polynucleotides may allow for a unique identity to beassigned to individual sequences or fragments of sequences. This mayallow acquisition of data from individual samples and is not limited toaverages of samples.

In some examples, nucleic acids or other molecules derived from a singlestrand may share a common tag or identifier and therefore may be lateridentified as being derived from that strand. Similarly, all of thefragments from a single strand of nucleic acid may be tagged with thesame identifier or tag, thereby permitting subsequent identification offragments from the parent strand. In other cases, gene expressionproducts (e.g., mRNA) may be tagged in order to quantify expression, bywhich the barcode, or the barcode in combination with sequence to whichit is attached can be counted. In still other cases, the systems andmethods can be used as a PCR amplification control. In such cases,multiple amplification products from a PCR reaction can be tagged withthe same tag or identifier. If the products are later sequenced anddemonstrate sequence differences, differences among products with thesame identifier can then be attributed to PCR error.

Additionally, individual sequences may be identified based uponcharacteristics of sequence data for the read themselves. For example,the detection of unique sequence data at the beginning (start) and end(stop) portions of individual sequencing reads may be used, alone or incombination, with the length, or number of base pairs of each sequenceread unique sequence to assign unique identities to individualmolecules. Fragments from a single strand of nucleic acid, having beenassigned a unique identity, may thereby permit subsequent identificationof fragments from the parent strand. This can be used in conjunctionwith bottlenecking the initial starting genetic material to limitdiversity.

Further, using unique sequence data at the beginning (start) and end(stop) portions of individual sequencing reads and sequencing readlength may be used, alone or combination, with the use of barcodes. Insome cases, the barcodes may be unique as described herein. In othercases, the barcodes themselves may not be unique. In this case, the useof non unique barcodes, in combination with sequence data at thebeginning (start) and end (stop) portions of individual sequencing readsand sequencing read length may allow for the assignment of a uniqueidentity to individual sequences. Similarly, fragments from a singlestrand of nucleic acid having been assigned a unique identity, maythereby permit subsequent identification of fragments from the parentstrand.

Generally, the methods and systems provided herein are useful forpreparation of cell free polynucleotide sequences to a down-streamapplication sequencing reaction. A sequencing method may be classicSanger sequencing. Sequencing methods may include, but are not limitedto: high-throughput sequencing, pyrosequencing, sequencing-by-synthesis,single-molecule sequencing, nanopore sequencing, semiconductorsequencing, sequencing-by-ligation, sequencing-by-hybridization, RNA-Seq(Illumina), Digital Gene Expression (Helicos), Next generationsequencing, Single Molecule Sequencing by Synthesis (SMSS)(Helicos),massively-parallel sequencing, Clonal Single Molecule Array (Solexa),shotgun sequencing, Maxim-Gilbert sequencing, primer walking, and anyother sequencing methods recognized in the art.

Assignment of Barcodes to Cell Free Polynucleotide Sequences

The systems and methods disclosed herein may be used in applicationsthat involve the assignment of unique or non-unique identifiers, ormolecular barcodes, to cell free polynucleotides. The identifier may bea bar-code oligonucleotide that is used to tag the polynucleotide; but,in some cases, different unique identifiers are used. For example, insome cases, the unique identifier is a hybridization probe. In othercases, the unique identifier is a dye, in which case the attachment maycomprise intercalation of the dye into the analyte molecule (such asintercalation into DNA or RNA) or binding to a probe labeled with thedye. In still other cases, the unique identifier may be a nucleic acidoligonucleotide, in which case the attachment to the polynucleotidesequences may comprise a ligation reaction between the oligonucleotideand the sequences or incorporation through PCR. In other cases, thereaction may comprise addition of a metal isotope, either directly tothe analyte or by a probe labeled with the isotope. Generally,assignment of unique or non-unique identifiers, or molecular barcodes inreactions of this disclosure may follow methods and systems describedby, for example, U.S. Patent Publication Nos. 2001/0053519,2003/0152490, 2011/0160078, and U.S. Pat. No. 6,582,908, each of whichis entirely incorporated herein by reference.

The method may comprise attaching oligonucleotide barcodes to nucleicacid analytes through an enzymatic reaction including but not limited toa ligation reaction. For example, the ligase enzyme may covalentlyattach a DNA barcode to fragmented DNA (e.g., high molecular-weightDNA). Following the attachment of the barcodes, the molecules may besubjected to a sequencing reaction.

However, other reactions may be used as well. For example,oligonucleotide primers containing barcode sequences may be used inamplification reactions (e.g., PCR, qPCR, reverse-transcriptase PCR,digital PCR, etc.) of the DNA template analytes, thereby producingtagged analytes. After assignment of barcodes to individual cell freepolynucleotide sequences, the pool of molecules may be sequenced.

In some cases, PCR may be used for global amplification of cell freepolynucleotide sequences. This may comprise using adapter sequences thatmay be first ligated to different molecules followed by PCRamplification using universal primers. PCR for sequencing may beperformed using any methodology, including but not limited to use ofcommercial kits provided by Nugen (WGA kit), Life Technologies,Affymetrix, Promega, Qiagen and the like. In other cases, only certaintarget molecules within a population of cell free polynucleotidemolecules may be amplified. Specific primers, may in conjunction withadapter ligation, may be used to selectively amplify certain targets fordownstream sequencing.

The unique identifiers (e.g., oligonucleotide bar-codes, antibodies,probes, etc.) may be introduced to cell free polynucleotide sequencesrandomly or non-randomly. In some cases, they are introduced at anexpected ratio of unique identifiers to microwells. For example, theunique identifiers may be loaded so that more than about 1, 2, 3, 4, 5,6, 7, 8, 9, 10, 20, 50, 100, 500, 1000, 5000, 10000, 50,000, 100,000,500,000, 1,000,000, 10,000,000, 50,000,000 or 1,000,000,000 uniqueidentifiers are loaded per genome sample. In some cases, the uniqueidentifiers may be loaded so that less than about 2, 3, 4, 5, 6, 7, 8,9, 10, 20, 50, 100, 500, 1000, 5000, 10000, 50,000, 100,000, 500,000,1,000,000, 10,000,000, 50,000,000 or 1,000,000,000 unique identifiersare loaded per genome sample. In some cases, the average number ofunique identifiers loaded per sample genome is less than, or greaterthan, about 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 50, 100, 500, 1000, 5000,10000, 50,000, 100,000, 500,000, 1,000,000, 10,000,000, 50,000,000 or1,000,000,000 unique identifiers per genome sample.

In some cases, the unique identifiers may be a variety of lengths suchthat each barcode is at least about 1 , 2, 3, 4, 5, 6, 7, 8, 9, 10, 20,50, 100, 500, 1000 base pairs. In other cases, the barcodes may compriseless than 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 50, 100, 500, 1000 basepairs.

In some cases, unique identifiers may be predetermined or random orsemi-random sequence oligonucleotides. In other cases, a plurality ofbarcodes may be used such that barcodes are not necessarily unique toone another in the plurality. In this example, barcodes may be ligatedto individual molecules such that the combination of the bar code andthe sequence it may be ligated to creates a unique sequence that may beindividually tracked. As described herein, detection of non uniquebarcodes in combination with sequence data of beginning (start) and end(stop) portions of sequence reads may allow assignment of a uniqueidentity to a particular molecule. The length, or number of base pairs,of an individual sequence read may also be used to assign a uniqueidentity to such a molecule. As described herein, fragments from asingle strand of nucleic acid having been assigned a unique identity,may thereby permit subsequent identification of fragments from theparent strand. In this way the polynucleotides in the sample can beuniquely or substantially uniquely tagged.

The unique identifiers may be used to tag a wide range of analytes,including but not limited to RNA or DNA molecules. For example, uniqueidentifiers (e.g., barcode oligonucleotides) may be attached to wholestrands of nucleic acids or to fragments of nucleic acids (e.g.,fragmented genomic DNA, fragmented RNA). The unique identifiers (e.g.,oligonucleotides) may also bind to gene expression products, genomicDNA, mitochondrial DNA, RNA, mRNA, and the like.

In many applications, it may be important to determine whetherindividual cell free polynucleotide sequences each receive a differentunique identifier (e.g., oligonucleotide barcode). If the population ofunique identifiers introduced into the systems and methods is notsignificantly diverse, different analytes may possibly be tagged withidentical identifiers. The systems and methods disclosed herein mayenable detection of cell free polynucleotide sequences tagged with thesame identifier. In some cases, a reference sequences may be includedwith the population of cell free polynucleotide sequences to beanalyzed. The reference sequence may be, for example, a nucleic acidwith a known sequence and a known quantity. If the unique identifiersare oligonucleotide barcodes and the analytes are nucleic acids, thetagged analytes may subsequently be sequenced and quantified. Thesemethods may indicate if one or more fragments and/or analytes may havebeen assigned an identical barcode.

A method disclosed herein may comprise utilizing reagents necessary forthe assignment of barcodes to the analytes. In the case of ligationreactions, reagents including, but not limited to, ligase enzyme,buffer, adapter oligonucleotides, a plurality of unique identifier DNAbarcodes and the like may be loaded into the systems and methods. In thecase of enrichment, reagents including but not limited to a plurality ofPCR primers, oligonucleotides containing unique identifying sequence, orbarcode sequence, DNA polymerase, DNTPs, and buffer and the like may beused in preparation for sequencing.

Generally, the method and system of this disclosure may utilize themethods of U.S. Pat. No. 7,537,897 in using molecular barcodes to countmolecules or analytes, which is entirely incorporated herein byreference.

In a sample comprising fragmented genomic DNA, e.g., cell-free DNA(cfDNA), from a plurality of genomes, there is some likelihood that morethan one polynucleotide from different genomes will have the same startand stop positions (“duplicates” or “cognates”). The probable number ofduplicates beginning at any position is a function of the number ofhaploid genome equivalents in a sample and the distribution of fragmentsizes. For example, cfDNA has a peak of fragments at about 160nucleotides, and most of the fragments in this peak range from about 140nucleotides to 180 nucleotides. Accordingly, cfDNA from a genome ofabout 3 billion bases (e.g., the human genome) may be comprised ofalmost 20 million (2×10⁷) polynucleotide fragments. A sample of about 30ng DNA can contain about 10,000 haploid human genome equivalents.(Similarly, a sample of about 100 ng of DNA can contain about 30,000haploid human genome equivalents.) A sample containing about 10,000(10⁴) haploid genome equivalents of such DNA can have about 200 billion(2×10¹¹) individual polynucleotide molecules. It has been empiricallydetermined that in a sample of about 10,000 haploid genome equivalentsof human DNA, there are about 3 duplicate polynucleotides beginning atany given position. Thus, such a collection can contain a diversity ofabout 6×10¹⁰-8×10¹⁰ (about 60 billion-80 billion e.g., about 70 billion(7×10¹⁰)) differently sequenced polynucleotide molecules.

The probability of correctly identifying molecules is dependent oninitial number of genome equivalents, the length distribution ofsequenced molecules, sequence uniformity and number of tags. When thetag count is equal to one, that is, equivalent to having no unique tagsor not tagging. The table below lists the probability of correctlyidentifying a molecule as unique assuming a typical cell-free sizedistribution as above.

Tag % Correctly uniquely Tag Count identified 1000 human haploid genomeequivalents 1 96.9643 4 99.2290 9 99.6539 16 99.8064 25 99.8741 10099.9685 3000 human haploid genome equivalents 1 91.7233 4 97.8178 999.0198 16 99.4424 25 99.6412 100 99.9107

In this case, upon sequencing the genomic DNA, it may not be possible todetermine which sequence reads are derived from which parent molecules.This problem can be diminished by tagging parent molecules with asufficient number of unique identifiers (e.g., the tag count) such thatthere is a likelihood that two duplicate molecules, i.e., moleculeshaving the same start and stop positions, bear different uniqueidentifiers so that sequence reads are traceable back to particularparent molecules. One approach to this problem is to uniquely tag every,or nearly every, different parent molecule in the sample. However,depending on the number of haploid gene equivalents and distribution offragment sizes in the sample, this may require billions of differentunique identifiers.

The above method can be cumbersome and expensive. Individualpolynucleotide fragments in a genomic nucleic acid sample (e.g., genomicDNA sample) can be uniquely identified by tagging with non-uniqueidentifiers, e.g., non-uniquely tagging the individual polynucleotidefragments. As used herein, a collection of molecules can be consideredto be “uniquely tagged” if each of at least 95% of the molecules in thecollection bears an identifying tag (“identifier”) that is not shared byany other molecule in the collection (“unique tag” or “uniqueidentifier”). For unique tags, the number of tags may be fewer than thenumber of unique molecules in the sample. For unique tags, the number oftags may be fewer than 10% of number of molecules in sample. For uniquetags, the number of tags may fewer than 1% of number of molecules insample. A collection of molecules can be considered to be “non-uniquelytagged” if each of at least 1%, at least 5%, at least 10%, at least 15%,at least 20%, at least 25%, at least 30%, at least 35%, at least 40%, atleast 45%, or at least or about 50% of the molecules in the collectionbears an identifying tag that is shared by at least one other moleculein the collection (“non-unique tag” or “non-unique identifier”). In someembodiments, for a non-uniquely tagged population, no more than 1%, 5%,10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, or 50% of the molecules areuniquely tagged. In some embodiments, for unique tagging, at least twotimes as many different tags are used as the estimated number ofmolecules in the sample. The number of different identifying tags usedto tag molecules in a collection can range, for example, between any of2, 4, 8, 16, or 32 at the low end of the range, and any of 50, 100, 500,1000, 5000 and 10,000 at the high end of the range. So, for example, acollection of between 100 billion and 1 trillion molecules can be taggedwith between 4 and 100 different identifying tags.

The present disclosure provides methods and compositions in which apopulation of polynucleotides in a sample of fragmented genomic DNA istagged with n different unique identifier. In some embodiments, n is atleast 2 and no more than 100,000*z, wherein z is a measure of centraltendency (e.g., mean, median, mode) of an expected number of duplicatemolecules having the same start and stop positions. In some embodiments,z is 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more than 10. In someembodiments, z is less than 10, less than 9, less than 8, less than 7,less than 6, less than 5, less than 4, less than 3. In certainembodiments, n is at least any of 2*z, 3*z, 4*z, 5*z, 6*z, 7*z, 8*z,9*z, 10*z, 11*z, 12*z, 13*z, 14*z, 15*z, 16*z, 17*z, 18*z, 19*z, or 20*z(e.g., lower limit). In other embodiments, n is no greater than100,000*z, 10,000*z, 1000*z or 100*z (e.g., upper limit). Thus, n canrange between any combination of these lower and upper limits. Incertain embodiments, n is between 5*z and 15*z, between 8*z and 12*z, orabout 10*z. For example, a haploid human genome equivalent has about 3picograms of DNA. A sample of about 1 microgram of DNA contains about300,000 haploid human genome equivalents. In some embodiments, thenumber n can be between 5 and 95, 6 and 80, 8 and 75, 10 and 70, 15 and45, between 24 and 36 or about 30. In some embodiments, the number n isless than 96. For example, the number n can be greater than or equal to2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21,22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 3435, 36, 37, 38, 39,40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57,58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75,76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93,94, or 95. In some situations, the number n can be greater than zero butless than 100, 99, 98, 97, 96, 95, 94, 93, 92, 91, or 90. In someexamples, the number n is 64. The number n can be less than 75, lessthan 50, less than 40, less than 30, less than 20, less than 10, or lessthan 5. Improvements in sequencing can be achieved as long as at leastsome of the duplicate or cognate polynucleotides bear uniqueidentifiers, that is, bear different tags. However, in certainembodiments, the number of tags used is selected so that there is atleast a 95% chance that all duplicate molecules comprising the samestart and end sequences bear unique identifiers.

Some embodiments provide methods for performing a ligation reaction inwhich parent polynucleotides in a sample are admixed with a reactionmixture comprising y different barcode oligonucleotides, wherein y=asquare root of n. The ligation can result in the random attachment ofbarcode oligonucleotides to parent polynucleotides in the sample. Thereaction mixture can then be incubated under ligation conditionssufficient to effect ligation of barcode oligonucleotides to parentpolynucleotides of the sample. In some embodiments, random barcodesselected from the y different barcode oligonucleotides are ligated toboth ends of parent polynucleotides. Random ligation of the y barcodesto one or both ends of the parent polynucleotides can result inproduction of y² unique identifiers. For example, a sample comprisingabout 10,000 haploid human genome equivalents of cfDNA can be taggedwith about 36 unique identifiers. The unique identifiers can comprisesix unique DNA barcodes. Ligation of 6 unique barcodes to both ends of apolynucleotide can result in 36 possible unique identifiers areproduced.

In some embodiments, a sample comprising about 10,000 haploid humangenome equivalents of DNA is tagged with 64 unique identifiers, whereinthe 64 unique identifiers are produced by ligation of 8 unique barcodesto both ends of parent polynucleotides. The ligation efficiency of thereaction can be over 10%, over 20%, over 30%, over 40%, over 50%, over60%, over 70%, over 80%, or over 90%. The ligation conditions cancomprise use of bi-directional adaptors that can bind either end of thefragment and still be amplifiable. The ligation conditions can compriseblunt end ligation, as opposed to tailing with forked adaptors. Theligation conditions can comprise careful titration of an amount ofadaptor and/or barcode oligonucleotides. The ligation conditions cancomprise the use of over 2×, over 5×, over 10×, over 20×, over 40×, over60×, over 80×, (e.g., ˜100×) molar excess of adaptor and/or barcodeoligonucleotides as compared to an amount of parent polynucleotidefragments in the reaction mixture. The ligation conditions can compriseuse of a T4 DNA ligase (e.g., NEBNExt Ultra Ligation Module). In anexample, 18 microliters of ligase master mix is used with 90 microliterligation (18 part of the 90) and ligation enhancer. Accordingly, taggingparent polynucleotides with n unique identifiers can comprise use of anumber y different barcodes, wherein y=a square root of n. Samplestagged in such a way can be those with a range of about 10 ng to any ofabout 100 ng, about 1 μg, about 10 μg of fragmented polynucleotides,e.g., genomic DNA, e.g. cfDNA. The number y of barcodes used to identifyparent polynucleotides in a sample can depend on the amount of nucleicacid in the sample.

The present disclosure also provides compositions of taggedpolynucleotides. The polynucleotides can comprise fragmented DNA, e.g.cfDNA. A set of polynucleotides in the composition that map to amappable base position in a genome can be non-uniquely tagged, that is,the number of different identifiers can be at least at least 2 and fewerthan the number of polynucleotides that map to the mappable baseposition. A composition of between about 10 ng to about 10 μg (e.g., anyof about 10 ng-1 μg, about 10 ng-100 ng, about 100 ng-10 μg, about 100ng-1 μg, about 1 μg-10 μg) can bear between any of 2, 5, 10, 50 or 100to any of 100, 1000, 10,000 or 100,000 different identifiers. Forexample, between 5 and 100 different identifiers can be used to tag thepolynucleotides in such a composition.

FIG. 2 shows an exemplary process for analyzing polynucleotides in asample of initial genetic material. First, a sample containing initialgenetic material is provided and cell free DNA can be extracted (50).The sample can include target nucleic acid in low abundance. Forexample, nucleic acid from a normal or wild-type genome (e.g., agermline genome) can predominate in a sample that also includes no morethan 20%, no more than 10%, no more than 5%, no more than 1%, no morethan 0.5% or no more than 0.1% nucleic acid from at least one othergenome containing genetic variation, e.g., a cancer genome or a fetalgenome, or a genome from another individual or species. The sample caninclude, for example, cell free nucleic acid or cells comprising nucleicacid with proper oversampling of the original polynucleotides by thesequencing or genetic analysis process.

Next, the initial genetic material is converted into a set of taggedparent polynucleotides and sequenced to produce sequence reads (52).This step generates a plurality of genomic fragment sequence reads. Insome cases, these sequences reads may contain barcode information. Inother examples, barcodes are not utilized. Tagging can include attachingsequenced tags to molecules in the initial genetic material. Sequencedtags can be selected so that all unique polynucleotides mapping to thesame reference sequence have a unique identifying tag. Conversion can beperformed at high efficiency, for example at least 50%. The set oftagged parent polynucleotides can be amplified to produce a set ofamplified progeny polynucleotides. Amplification may be, for example,1,000-fold. The set of amplified progeny polynucleotides is sampled forsequencing at a sampling rate so that the sequence reads produced both(1) cover a target number of unique molecules in the set of taggedparent polynucleotides and (2) cover unique molecules in the set oftagged parent polynucleotides at a target coverage fold (e.g., 5- to10-fold coverage of parent polynucleotides. The set of sequence reads iscollapsed to produce a set of consensus sequences corresponding tounique tagged parent polynucleotides. Sequence reads can be qualifiedfor inclusion in the analysis. For example, sequence reads that fail tomeet a quality control score can be removed from the pool. Sequencereads can be sorted into families representing reads of progenymolecules derived from a particular unique parent molecule. For example,a family of amplified progeny polynucleotides can constitute thoseamplified molecules derived from a single parent polynucleotide. Bycomparing sequences of progeny in a family, a consensus sequence of theoriginal parent polynucleotide can be deduced. This produces a set ofconsensus sequences representing unique parent polynucleotides in thetagged pool.

Next, the process assigns a confidence score for the sequence (54).After sequencing, reads are assigned a quality score. A quality scoremay be a representation of reads that indicates whether those reads maybe useful in subsequent analysis based on a threshold. In some cases,some reads are not of sufficient quality or length to perform thesubsequent mapping step. Sequencing reads with a predetermined qualityscore (above 90% for example) may be filtered out of the data. Thegenomic fragment reads that meet a specified quality score threshold aremapped to a reference genome, or a template sequence that is known notto contain copy number variations. After mapping alignment, sequencereads are assigned a mapping score. A mapping score may be arepresentation or reads mapped back to the reference sequence indicatingwhether each position is or is not uniquely mappable. In instances,reads may be sequences unrelated to copy number variation analysis. Forexample, some sequence reads may originate from contaminantpolynucleotides. Sequencing reads with a mapping score at least 90%,95%, 99%, 99.9%, 99.99% or 99.999% may be filtered out of the data set.In other cases, sequencing reads assigned a mapping scored less than apredetermined percentage may be filtered out of the data set.

The genomic fragment reads that meet a specified quality score thresholdare mapped to a reference genome, or a template sequence that is knownnot to contain copy number variations. After mapping alignment, sequencereads are assigned a mapping score. In instances, reads may be sequencesunrelated to copy number variation analysis. After data filtering andmapping, the plurality of sequence reads generates a chromosomal regionof coverage. These chromosomal regions may be divided into variablelength windows or bins. A window or bin may be at least 5 kb, 10, kb, 25kb, 30 kb, 35, kb, 40 kb, 50 kb, 60 kb, 75 kb, 100 kb, 150 kb, 200 kb,500 kb, or 1000 kb. A window or bin may also have bases up to 5 kb, 10,kb, 25 kb, 30 kb, 35, kb, 40 kb, 50 kb, 60 kb, 75 kb, 100 kb, 150 kb,200 kb, 500 kb, or 1000 kb. A window or bin may also be about 5 kb, 10,kb, 25 kb, 30 kb, 35, kb, 40 kb, 50 kb, 60 kb, 75 kb, 100 kb, 150 kb,200 kb, 500 kb, or 1000 kb.

For coverage normalization, each window or bin is selected to containabout the same number of mappable bases. In some cases, each window orbin in a chromosomal region may contain the exact number of mappablebases. In other cases, each window or bin may contain a different numberof mappable bases. Additionally, each window or bin may benon-overlapping with an adjacent window or bin. In other cases, a windowor bin may overlap with another adjacent window or bin. In some cases awindow or bin may overlap by at least 1 bp, 2 bp, 3 bp, 4 bp, 5, bp, 10bp, 20 bp, 25 bp, 50 bp, 100 bp, 200 bp, 250 bp, 500 bp, or 1000 bp.

In some cases, each of the window regions may be sized so they containabout the same number of uniquely mappable bases. The mappability ofeach base that comprise a window region is determined and used togenerate a mappability file which contains a representation of readsfrom the references that are mapped back to the reference for each file.The mappability file contains one row per every position, indicatingwhether each position is or is not uniquely mappable.

Additionally, predefined windows, known throughout the genome to be hardto sequence, or contain a substantially high GC bias, may be filteredfrom the data set. For example, regions known to fall near thecentromere of chromosomes (i.e., centromeric DNA) are known to containhighly repetitive sequences that may produce false positive results.These regions may be filtered out. Other regions of the genome, such asregions that contain an unusually high concentration of other highlyrepetitive sequences such as microsatellite DNA, may be filtered fromthe data set.

The number of windows analyzed may also vary. In some cases, at least10, 20, 30, 40, 50, 100, 200, 500, 1000, 2000, 5,000, 10,000, 20,000,50,000 or 100,000 windows are analyzed. In other cases, the number ofwidows analyzed is up to 10, 20, 30, 40, 50, 100, 200, 500, 1000, 2000,5,000, 10,000, 20,000, 50,000 or 100,000 windows are analyzed.

For an exemplary genome derived from cell free polynucleotide sequences,the next step comprises determining read coverage for each windowregion. This may be performed using either reads with barcodes, orwithout barcodes. In cases without barcodes, the previous mapping stepswill provide coverage of different base positions. Sequence reads thathave sufficient mapping and quality scores and fall within chromosomewindows that are not filtered, may be counted. The number of coveragereads may be assigned a score per each mappable position. In casesinvolving barcodes, all sequences with the same barcode, physicalproperties or combination of the two may be collapsed into one read, asthey are all derived from the sample parent molecule. This step reducesbiases which may have been introduced during any of the preceding steps,such as steps involving amplification. For example, if one molecule isamplified 10 times but another is amplified 1000 times, each molecule isonly represented once after collapse thereby negating the effect ofuneven amplification. Only reads with unique barcodes may be counted foreach mappable position and influence the assigned score. For thisreason, it is important that the barcode ligation step be performed in amanner optimized for producing the lowest amount of bias. The sequencefor each base is aligned as the most dominant nucleotide read for thatspecific location. Further, the number of unique molecules can becounted at each position to derive simultaneous quantification at eachposition. This step reduces biases which may have been introduced duringany of the preceding steps, such as steps involving amplification.

The discrete copy number states of each window region can be utilized toidentify copy number variation in the chromosomal regions. In somecases, all adjacent window regions with the same copy number can bemerged into a segment to report the presence or absence of copy numbervariation state. In some cases, various windows can be filtered beforethey are merged with other segments.

In determining the nucleic acid read coverage for each window, thecoverage of each window can be normalized by the mean coverage of thatsample. Using such an approach, it may be desirable to sequence both thetest subject and the control under similar conditions. The read coveragefor each window may be then expressed as a ratio across similar windows.

Nucleic acid read coverage ratios for each window of the test subjectcan be determined by dividing the read coverage of each window region ofthe test sample with read coverage of a corresponding window region ofthe control ample.

Next, the process looks up prior confidence scores for each read familyfor the patient (58). This information is stored in a database. Prioranalysis of the patient's test result can be used to refine theconfidence score, as detailed in FIG. 2. The information is used toinfer the frequency of each sequence read at a locus in the set oftagged parent polynucleotides based on confidence scores among sequenceread families (60). The historical database is then updated with thecurrent confidence score for future use (62). In this manner, consensussequences can be generated from families of sequence reads to improvenoise elimination.

Turning now to FIG. 3, the process receives genetic materials from bloodsample or other body samples (102). The process converts thepolynucleotides from the genetic materials into tagged parentnucleotides (104). The tagged parent nucleotides are amplified toproduce amplified progeny polynucleotides (106). A subset of theamplified polynucleotides is sequenced to produce sequence reads (108),which are grouped into families, each generated from a unique taggedparent nucleotide (110). At a selected locus, the process assigns eachfamily a confidence score for each family (112). Next, a consensus isdetermined using prior readings. This is done by reviewing priorconfidence score for each family, and if consistent prior confidencescores exists, then the current confidence score is increased (114). Ifthere are prior confidence scores, but they are inconsistent, thecurrent confidence score is not modified in one embodiment (116). Inother embodiments, the confidence score is adjusted in a predeterminedmanner for inconsistent prior confidence scores. If this is a first timethe family is detected, the current confidence score can be reduced asit may be a false reading (118). The process can infer the frequency ofthe family at the locus in the set of tagged parent polynucleotidesbased on the confidence score (120).

While temporal information has been used in FIGS. 1-2 to enhance theinformation for mutation or copy number variation detection, otherconsensus methods can be applied. In other embodiments, the historicalcomparison can be used in conjunction with other consensus sequencesmapping to a particular reference sequence to detect instances ofgenetic variation. Consensus sequences mapping to particular referencesequences can be measured and normalized against control samples.Measures of molecules mapping to reference sequences can be comparedacross a genome to identify areas in the genome in which copy numbervaries, or heterozygosity is lost. Consensus methods include, forexample, linear or non-linear methods of building consensus sequences(such as voting, averaging, statistical, maximum a posteriori or maximumlikelihood detection, dynamic programming, Bayesian, hidden Markov orsupport vector machine methods, etc.) derived from digital communicationtheory, information theory, or bioinformatics. After the sequence readcoverage has been determined, a stochastic modeling algorithm is appliedto convert the normalized nucleic acid sequence read coverage for eachwindow region to the discrete copy number states. In some cases, thisalgorithm may comprise one or more of the following: Hidden MarkovModel, dynamic programming, support vector machine, Bayesian network,trellis decoding, Viterbi decoding, expectation maximization, Kalmanfiltering methodologies and neural networks.

After this, a report can be generated. For example, the copy numbervariation may be reported as graph, indicating various positions in thegenome and a corresponding increase or decrease or maintenance of copynumber variation at each respective position. Additionally, copy numbervariation may be used to report a percentage score indicating how muchdisease material (or nucleic acids having a copy number variation)exists in the cell free polynucleotide sample.

In one embodiment, the report includes annotations to help physicians.The annotating can include annotating a report for a condition in theNCCN Clinical Practice Guidelines in Oncology™ or the American Societyof Clinical Oncology (ASCO) clinical practice guidelines. The annotatingcan include listing one or more FDA-approved drugs for off-label use,one or more drugs listed in a Centers for Medicare and Medicaid Services(CMS) anti-cancer treatment compendia, and/or one or more experimentaldrugs found in scientific literature, in the report. The annotating caninclude connecting a listed drug treatment option to a referencecontaining scientific information regarding the drug treatment option.The scientific information can be from a peer-reviewed article from amedical journal. The annotating can include using information providedby Ingenuity® Systems. The annotating can include providing a link toinformation on a clinical trial for a drug treatment option in thereport. The annotating can include presenting information in a pop-upbox or fly-over box near provided drug treatment options in anelectronic based report. The annotating can include adding informationto a report selected from the group consisting of one or more drugtreatment options, scientific information concerning one or more drugtreatment options, one or more links to scientific information regardingone or more drug treatment options, one or more links to citations forscientific information regarding one or more drug treatment options, andclinical trial information regarding one or more drug treatment options.

As depicted in FIG. 4, a comparison of sequence coverage to a controlsample or reference sequence may aid in normalization across windows. Inthis embodiment, cell free DNAs are extracted and isolated from areadily accessible bodily fluid such as blood. For example, cell freeDNAs can be extracted using a variety of methods recognized in the art,including but not limited to isopropanol precipitation and/or silicabased purification. Cell free DNAs may be extracted from any number ofsubjects, such as subjects without cancer, subjects at risk for cancer,or subjects known to have cancer.

Following the isolation/extraction step, any of a number of differentsequencing operations may be performed on the cell free polynucleotidesample. Samples may be processed before sequencing with one or morereagents (e.g., enzymes, unique identifiers (e.g., barcodes), probes,etc.). In some cases if the sample is processed with a unique identifiersuch as a barcode, the samples or fragments of samples may be taggedindividually or in subgroups with the unique identifier. The taggedsample may then be used in a downstream application such as a sequencingreaction by which individual molecules may be tracked to parentmolecules.

Generally, as shown in FIG. 4, mutation detection may be performed onselectively enriched regions of the genome or transcriptome purified andisolated (302). As described herein, specific regions, which may includebut are not limited to genes, oncogenes, tumor suppressor genes,promoters, regulatory sequence elements, non-coding regions, miRNAs,snRNAs and the like may be selectively amplified from a total populationof cell free polynucleotides. This may be performed as herein described.In one example, multiplex sequencing may be used, with or withoutbarcode labels for individual polynucleotide sequences. In otherexamples, sequencing may be performed using any nucleic acid sequencingplatforms recognized in the art. This step generates a plurality ofgenomic fragment sequence reads (304). Additionally, a referencesequence is obtained from a control sample, taken from another subject.In some cases, the control subject may be a subject known to not haveknown genetic variations or disease. In some cases, these sequence readsmay contain barcode information. In other examples, barcodes are notutilized. In yet other examples, non-unique sequence tags are used.

After sequencing, reads are assigned a quality score. A quality scoremay be a representation of reads that indicates whether those reads maybe useful in subsequent analysis based on a threshold. In some cases,some reads are not of sufficient quality or length to perform thesubsequent mapping step. In step 306, the genomic fragment reads thatmeet a specified quality score threshold are mapped to a referencegenome, or a reference sequence that is known not to contain mutations.After mapping alignment, sequence reads are assigned a mapping score. Amapping score may be a representation or reads mapped back to thereference sequence indicating whether each position is or is notuniquely mappable. In instances, reads may be sequences unrelated tomutation analysis. For example, some sequence reads may originate fromcontaminant polynucleotides. Sequencing reads with a mapping score atleast 90%, 95%, 99%, 99.9%, 99.99% or 99.999% may be filtered out of thedata set. In other cases, sequencing reads assigned a mapping scoredless than 90%, 95%, 99%, 99.9%, 99.99% or 99.999% may be filtered out ofthe data set.

For each mappable base, bases that do not meet the minimum threshold formappability, or low quality bases, may be replaced by the correspondingbases as found in the reference sequence.

Once read coverage may be ascertained and variant bases relative to thecontrol sequence in each read are identified, the frequency of variantbases may be calculated as the number of reads containing the variantdivided by the total number of reads (308). This may be expressed as aratio for each mappable position in the genome.

For each base position, the frequencies of all four nucleotides,cytosine, guanine, thymine, adenine are analyzed in comparison to thereference sequence (310). A stochastic or statistical modeling algorithmis applied to convert the normalized ratios for each mappable positionto reflect frequency states for each base variant. In some cases, thisalgorithm may comprise one or more of the following: Hidden MarkovModel, dynamic programming, support vector machine, Bayesian orprobabilistic modeling, trellis decoding, Viterbi decoding, expectationmaximization, Kalman filtering methodologies, and neural networks.

The discrete mutation states of each base position can be utilized toidentify a base variant with high frequency of variance as compared tothe baseline of the reference sequence. In some cases, the baselinemight represent a frequency of at least 0.0001%, 0.001%, 0.01%, 0.1%,1.0%, 2.0%, 3.0%, 4.0% 5.0%, 10%, or 25%. In other cases the baselinemight represent a frequency of at least 0.0001%, 0.001%, 0.01%, 0.1%,1.0%, 2.0%, 3.0%, 4.0% 5.0%, 10%, or 25%. In some cases, all adjacentbase positions with the base variant or mutation can be merged into asegment to report the presence or absence of a mutation. In some cases,various positions can be filtered before they are merged with othersegments.

After calculation of frequencies of variance for each base position, thevariant with largest deviation for a specific position in the sequencederived from the subject as compared to the reference sequence isidentified as a mutation. In some cases, a mutation may be a cancermutation. In other cases, a mutation might be correlated with a diseasestate.

A mutation or variant may comprise a genetic aberration that includes,but is not limited to a single base substitution, a transversion, atranslocation, an inversion, a deletion, aneuploidy, partial aneuploidy,polyploidy, chromosomal instability, chromosomal structure alterations,chromosome fusions, a gene truncation, a gene amplification, a geneduplication, a chromosomal lesion, a DNA lesion, abnormal changes innucleic acid chemical modifications, abnormal changes in epigeneticpatterns and abnormal changes in nucleic acid methylation. In somecases, a mutation may be at most 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15 or 20nucleotides in length. On other cases a mutation may be at least 1, 2,3, 4, 5, 6, 7, 8, 9, 10, 15 or 20 nucleotides in length.

Next, a consensus is determined using prior readings. This is done byreviewing prior confidence score for the corresponding bases, and ifconsistent prior confidence scores exists, then the current confidencescore is increased (314). If there are prior confidence scores, but theyare inconsistent, the current confidence score is not modified in oneembodiment (316). In other embodiments, the confidence score is adjustedin a predetermined manner for inconsistent prior confidence scores. Ifthis is a first time the family is detected, the current confidencescore can be reduced as it may be a false reading (318). The processthen converts the frequency of variance per each base into discretevariant states for each base position (320).

The presence or absence of a mutation may be reflected in graphicalform, indicating various positions in the genome and a correspondingincrease or decrease or maintenance of a frequency of mutation at eachrespective position. Additionally, mutations may be used to report apercentage score indicating how much disease material exists in the cellfree polynucleotide sample. A confidence score may accompany eachdetected mutation, given known statistics of typical variances atreported positions in non-disease reference sequences. Mutations mayalso be ranked in order of abundance in the subject or ranked byclinically actionable importance.

Next, applications of the technology are detailed. One application isDetection of Cancer. Numerous cancers may be detected using the methodsand systems described herein. Cancers cells, as most cells, can becharacterized by a rate of turnover, in which old cells die and replacedby newer cells. Generally dead cells, in contact with vasculature in agiven subject, may release DNA or fragments of DNA into the bloodstream. This is also true of cancer cells during various stages of thedisease. Cancer cells may also be characterized, dependent on the stageof the disease, by various genetic variations such as copy numbervariation as well as mutations. This phenomenon may be used to detectthe presence or absence of cancers individuals using the methods andsystems described herein.

For example, blood from subjects at risk for cancer may be drawn andprepared as described herein to generate a population of cell freepolynucleotides. In one example, this might be cell free DNA. Thesystems and methods of the disclosure may be employed to detectmutations or copy number variations that may exist in certain cancerspresent. The method may help detect the presence of cancerous cells inthe body, despite the absence of symptoms or other hallmarks of disease.

The types and number of cancers that may be detected may include but arenot limited to blood cancers, brain cancers, lung cancers, skin cancers,nose cancers, throat cancers, liver cancers, bone cancers, lymphomas,pancreatic cancers, skin cancers, bowel cancers, rectal cancers, thyroidcancers, bladder cancers, kidney cancers, mouth cancers, stomachcancers, solid state tumors, heterogeneous tumors, homogenous tumors andthe like.

In the early detection of cancers, any of the systems or methods hereindescribed, including mutation detection or copy number variationdetection may be utilized to detect cancers. These system and methodsmay be used to detect any number of genetic variations that may cause orresult from cancers. These may include but are not limited to mutations,indels, copy number variations, transversions, translocations,inversion, deletions, aneuploidy, partial aneuploidy, polyploidy,chromosomal instability, chromosomal structure alterations, genefusions, chromosome fusions, gene truncations, gene amplification, geneduplications, chromosomal lesions, DNA lesions, abnormal changes innucleic acid chemical modifications, abnormal changes in epigeneticpatterns, abnormal changes in nucleic acid methylation infection andcancer.

Additionally, the systems and methods described herein may also be usedto help characterize certain cancers. Genetic data produced from thesystem and methods of this disclosure may allow practitioners to helpbetter characterize a specific form of cancer. Cancers may beheterogeneous in both composition and staging. Genetic profile data mayallow characterization of specific sub-types of cancer that may beimportant in the diagnosis or treatment of that specific sub-type. Thisinformation may also provide a subject or practitioner clues regardingthe prognosis of a specific type of cancer.

The systems and methods provided herein may be used to monitor cancers,or other diseases in a particular subject. This may allow either asubject or practitioner to adapt treatment options in accord with theprogress of the disease. In this example, the systems and methodsdescribed herein may be used to construct genetic profiles of aparticular subject of the course of the disease. In some instances,cancers can progress, becoming more aggressive and genetically unstable.In other examples, cancers may remain benign, inactive or dormant. Thesystem and methods of this disclosure may be useful in determiningdisease progression.

Further, the systems and methods described herein may be useful indetermining the efficacy of a particular treatment option. In oneexample, successful treatment options may actually increase the amountof copy number variation or mutations detected in subject's blood if thetreatment is successful as more cancers may die and shed DNA. In otherexamples, this may not occur. In another example, perhaps certaintreatment options may be correlated with genetic profiles of cancersover time. This correlation may be useful in selecting a therapy.Additionally, if a cancer is observed to be in remission aftertreatment, the systems and methods described herein may be useful inmonitoring residual disease or recurrence of disease.

The methods and systems described herein may not be limited to detectionof mutations and copy number variations associated with only cancers.Various other diseases and infections may result in other types ofconditions that may be suitable for early detection and monitoring. Forexample, in certain cases, genetic disorders or infectious diseases maycause a certain genetic mosaicism within a subject. This geneticmosaicism may cause copy number variation and mutations that could beobserved. In another example, the system and methods of the disclosuremay also be used to monitor the genomes of immune cells within the body.Immune cells, such as B cells, may undergo rapid clonal expansion uponthe presence certain diseases. Clonal expansions may be monitored usingcopy number variation detection and certain immune states may bemonitored. In this example, copy number variation analysis may beperformed over time to produce a profile of how a particular disease maybe progressing.

Further, the systems and methods of this disclosure may also be used tomonitor systemic infections themselves, as may be caused by a pathogensuch as a bacteria or virus. Copy number variation or even mutationdetection may be used to determine how a population of pathogens arechanging during the course of infection. This may be particularlyimportant during chronic infections, such as HIV/AIDs or Hepatitisinfections, whereby viruses may change life cycle state and/or mutateinto more virulent forms during the course of infection.

Yet another example that the system and methods of this disclosure maybe used for is the monitoring of transplant subjects. Generally,transplanted tissue undergoes a certain degree of rejection by the bodyupon transplantation. The methods of this disclosure may be used todetermine or profile rejection activities of the host body, as immunecells attempt to destroy transplanted tissue. This may be useful inmonitoring the status of transplanted tissue as well as altering thecourse of treatment or prevention of rejection.

Further, the methods of the disclosure may be used to characterize theheterogeneity of an abnormal condition in a subject, the methodcomprising generating a genetic profile of extracellular polynucleotidesin the subject, wherein the genetic profile comprises a plurality ofdata resulting from copy number variation and mutation analyses. In somecases, including but not limited to cancer, a disease may beheterogeneous. Disease cells may not be identical. In the example ofcancer, some tumors comprise different types of tumor cells, some cellsin different stages of the cancer. In other examples, heterogeneity maycomprise multiple foci of disease. Again, in the example of cancer,there may be multiple tumor foci, perhaps where one or more foci are theresult of metastases that have spread from a primary site.

The methods of this disclosure may be used to generate or profile,fingerprint or set of data that is a summation of genetic informationderived from different cells in a heterogeneous disease. This set ofdata may comprise copy number variation and mutation analyses alone orin combination.

Additionally, the systems and methods of the disclosure may be used todiagnose, prognose, monitor or observe cancers or other diseases offetal origin. That is, these methodologies may be employed in a pregnantsubject to diagnose, prognose, monitor or observe cancers or otherdiseases in a unborn subject whose DNA and other polynucleotides mayco-circulate with maternal molecules.

Further, these reports are submitted and accessed electronically via theinternet. Analysis of sequence data occurs at a site other than thelocation of the subject. The report is generated and transmitted to thesubject's location. Via an internet enabled computer, the subjectaccesses the reports reflecting his tumor burden.

The annotated information can be used by a health care provider toselect other drug treatment options and/or provide information aboutdrug treatment options to an insurance company. The method can includeannotating the drug treatment options for a condition in, for example,the NCCN Clinical Practice Guidelines in Oncology™ or the AmericanSociety of Clinical Oncology (ASCO) clinical practice guidelines.

The drug treatment options that are stratified in a report can beannotated in the report by listing additional drug treatment options. Anadditional drug treatment can be an FDA-approved drug for an off-labeluse. A provision in the 1993 Omnibus Budget Reconciliation Act (OBRA)requires Medicare to cover off-label uses of anticancer drugs that areincluded in standard medical compendia. The drugs used for annotatinglists can be found in CMS approved compendia, including the NationalComprehensive Cancer Network (NCCN) Drugs and Biologics Compendium™,Thomson Micromedex DrugDex®, Elsevier Gold Standard's ClinicalPharmacology compendium, and American Hospital Formulary Service—DrugInformation Compendium®.

The drug treatment options can be annotated by listing an experimentaldrug that may be useful in treating a cancer with one or more molecularmarkers of a particular status. The experimental drug can be a drug forwhich in vitro data, in vivo data, animal model data, pre-clinical trialdata, or clinical-trial data are available. The data can be published inpeer-reviewed medical literature found in journals listed in the CMSMedicare Benefit Policy Manual, including, for example, American Journalof Medicine, Annals of Internal Medicine, Annals of Oncology, Annals ofSurgical Oncology, Biology of Blood and Marrow Transplantation, Blood,Bone Marrow Transplantation, British Journal of Cancer, British Journalof Hematology, British Medical Journal, Cancer, Clinical CancerResearch, Drugs, European Journal of Cancer (formerly the EuropeanJournal of Cancer and Clinical Oncology), Gynecologic Oncology,International Journal of Radiation, Oncology, Biology, and Physics, TheJournal of the American Medical Association, Journal of ClinicalOncology, Journal of the National Cancer Institute, Journal of theNational Comprehensive Cancer Network (NCCN), Journal of Urology,Lancet, Lancet Oncology, Leukemia, The New England Journal of Medicine,and Radiation Oncology.

The drug treatment options can be annotated by providing a link on anelectronic based report connecting a listed drug to scientificinformation regarding the drug. For example, a link can be provided toinformation regarding a clinical trial for a drug (clinicaltrials.gov).If the report is provided via a computer or computer website, the linkcan be a footnote, a hyperlink to a website, a pop-up box, or a fly-overbox with information, etc. The report and the annotated information canbe provided on a printed form, and the annotations can be, for example,a footnote to a reference.

The information for annotating one or more drug treatment options in areport can be provided by a commercial entity that stores scientificinformation, for example, Ingenuity® Systems. A health care provider cantreat a subject, such as a cancer patient, with an experimental druglisted in the annotated information, and the health care provider canaccess the annotated drug treatment option, retrieve the scientificinformation (e.g., print a medical journal article) and submit it (e.g.,a printed journal article) to an insurance company along with a requestfor reimbursement for providing the drug treatment. Physicians can useany of a variety of Diagnosis-related group (DRG) codes to enablereimbursement.

A drug treatment option in a report can also be annotated withinformation regarding other molecular components in a pathway that adrug affects (e.g., information on a drug that targets a kinasedownstream of a cell-surface receptor that is a drug target). The drugtreatment option can be annotated with information on drugs that targetone or more other molecular pathway components. The identificationand/or annotation of information related to pathways can be outsourcedor subcontracted to another company.

The annotated information can be, for example, a drug name (e.g., an FDAapproved drug for off-label use; a drug found in a CMS approvedcompendium, and/or a drug described in a scientific (medical) journalarticle), scientific information concerning one or more drug treatmentoptions, one or more links to scientific information regarding one ormore drugs, clinical trial information regarding one or more drugs(e.g., information from clinicaltrials.gov/), one or more links tocitations for scientific information regarding drugs, etc.

The annotated information can be inserted into any location in a report.Annotated information can be inserted in multiple locations on a report.Annotated information can be inserted in a report near a section onstratified drug treatment options. Annotated information can be insertedinto a report on a separate page from stratified drug treatment options.A report that does not contain stratified drug treatment options can beannotated with information.

The provided methods can also be utilized for investigating the effectsof drugs on sample (e.g. tumor cells) isolated from a subject (e.g.cancer patient). An in vitro culture using a tumor from a cancer patientcan be established using techniques recognized by those skilled in theart.

The provided method can also include high-throughput screening of FDAapproved off-label drugs or experimental drugs using the in vitroculture and/or xenograft model.

The provided method can also include monitoring tumor antigen forrecurrence detection.

Reports may be generated, mapping genome positions and copy numbervariation for the subject with cancer, as shown in FIGS. 5A and 5B.These reports, in comparison to other profiles of subjects with knownoutcomes, can indicate that a particular cancer is aggressive andresistant to treatment. The subject is monitored for a period andretested. If at the end of the period, the copy number variation profilebegins to increase dramatically, this may indicate that the currenttreatment is not working. A comparison is done with genetic profiles ofother prostate subjects. For example, if it is determined that thisincrease in copy number variation indicates that the cancer isadvancing, then the original treatment regimen as prescribed is nolonger treating the cancer and a new treatment is prescribed.

In an embodiment, the system supports the gene panel shown in FIG. 9.The gene panel of FIG. 9 may be used with methods and systems of thepresent disclosure.

These reports may be submitted and accessed electronically via theinternet. Analysis of sequence data occurs at a site other than thelocation of the subject. The report is generated and transmitted to thesubject's location. Via an internet enabled computer, the subjectaccesses the reports reflecting his tumor burden (FIGS. 5A and 5B).

FIG. 6 is schematic representation of internet enabled access of reportsof a subject with cancer. The system of FIG. 6 can use a handheld DNAsequencer or a desktop DNA sequencer. The DNA sequencer is a scientificinstrument used to automate the DNA sequencing process. Given a sampleof DNA, a DNA sequencer is used to determine the order of the fourbases: adenine, guanine, cytosine, and thymine. The order of the DNAbases is reported as a text string, called a read. Some DNA sequencerscan be also considered optical instruments as they analyze light signalsoriginating from fluorochromes attached to nucleotides.

The DNA sequencer can apply Gilbert's sequencing method based onchemical modification of DNA followed by cleavage at specific bases, orit can apply Sanger's technique which is based on dideoxynucleotidechain termination. The Sanger method became popular due to its increasedefficiency and low radioactivity. The DNA sequencer can use techniquesthat do not require DNA amplification (polymerase chain reaction—PCR),which speeds up the sample preparation before sequencing and reduceserrors. In addition, sequencing data is collected from the reactionscaused by the addition of nucleotides in the complementary strand inreal time. For example, the DNA sequencers can utilize a method calledSingle-molecule real-time (SMRT), where sequencing data is produced bylight (captured by a camera) emitted when a nucleotide is added to thecomplementary strand by enzymes containing fluorescent dyes.Alternatively, the DNA sequencers can use electronic systems based onnanopore sensing technologies.

The data is sent by the DNA sequencers over a direct connection or overthe internet to a computer for processing. The data processing aspectsof the system can be implemented in digital electronic circuitry, or incomputer hardware, firmware, software, or in combinations of them. Dataprocessing apparatus of the invention can be implemented in a computerprogram product tangibly embodied in a machine-readable storage devicefor execution by a programmable processor; and data processing methodsteps of the invention can be performed by a programmable processorexecuting a program of instructions to perform functions of theinvention by operating on input data and generating output. The dataprocessing aspects of the invention can be implemented advantageously inone or more computer programs that are executable on a programmablesystem including at least one programmable processor coupled to receivedata and instructions from and to transmit data and instructions to adata storage system, at least one input device, and at least one outputdevice. Each computer program can be implemented in a high-levelprocedural or object-oriented programming language, or in assembly ormachine language, if desired; and, in any case, the language can be acompiled or interpreted language. Suitable processors include, by way ofexample, both general and special purpose microprocessors. Generally, aprocessor will receive instructions and data from a read-only memoryand/or a random access memory. Storage devices suitable for tangiblyembodying computer program instructions and data include all forms ofnonvolatile memory, including by way of example semiconductor memorydevices, such as EPROM, EEPROM, and flash memory devices; magnetic diskssuch as internal hard disks and removable disks; magneto-optical disks;and CD-ROM disks. Any of the foregoing can be supplemented by, orincorporated in, ASICs (application-specific integrated circuits).

To provide for interaction with a user, the invention can be implementedusing a computer system having a display device such as a monitor or LCD(liquid crystal display) screen for displaying information to the userand input devices by which the user can provide input to the computersystem such as a keyboard, a two-dimensional pointing device such as amouse or a trackball, or a three-dimensional pointing device such as adata glove or a gyroscopic mouse. The computer system can be programmedto provide a graphical user interface through which computer programsinteract with users. The computer system can be programmed to provide avirtual reality, three-dimensional display interface. Computer controlsystems

The present disclosure provides computer control systems that areprogrammed to implement methods of the disclosure. FIG. 7 shows acomputer system 701 that is programmed or otherwise configured toanalyze genetic data. The methods described herein for detecting geneticvariations below a detection limit may provide for more efficientprocessing of genetic data, thereby improving the functioning of acomputer system. For example, the computer system may be able to processgenetic data and identify a genetic variant more quickly or efficiently(e.g., no re-processing of the genetic data or processing of additionalgenetic data may be necessary if the computer system may identify thegenetic variant below the detection limit).

The computer system 701 can regulate various aspects of detectinggenetic variations below a noise range or detection limit of the presentdisclosure, such as, for example, detecting genetic variations innucleic acid molecules, comparing sets of genetic variations,determining diagnostic confidence indications, determining confidenceintervals, sequencing nucleic acids, including massively parallelsequencing, grouping sequence reads into families, collapsing groupedsequence reads, determining consensus sequences. The computer system 801can be an electronic device of a user or a computer system that isremotely located with respect to the electronic device. The electronicdevice can be a mobile electronic device.

The computer system 701 includes a central processing unit (CPU, also“processor” and “computer processor” herein) 705, which can be a singlecore or multi core processor, or a plurality of processors for parallelprocessing. The computer system 701 also includes memory or memorylocation 710 (e.g., random-access memory, read-only memory, flashmemory), electronic storage unit 715 (e.g., hard disk), communicationinterface 720 (e.g., network adapter) for communicating with one or moreother systems, and peripheral devices 725, such as cache, other memory,data storage and/or electronic display adapters. The memory 710, storageunit 715, interface 720 and peripheral devices 725 are in communicationwith the CPU 705 through a communication bus (solid lines), such as amotherboard. The storage unit 715 can be a data storage unit (or datarepository) for storing data. The computer system 701 can be operativelycoupled to a computer network (“network”) 730 with the aid of thecommunication interface 720. The network 730 can be the Internet, aninternet and/or extranet, or an intranet and/or extranet that is incommunication with the Internet. The network 730 in some cases is atelecommunication and/or data network. The network 730 can include oneor more computer servers, which can enable distributed computing, suchas cloud computing. The network 730, in some cases with the aid of thecomputer system 701, can implement a peer-to-peer network, which mayenable devices coupled to the computer system 701 to behave as a clientor a server.

The CPU 705 can execute a sequence of machine-readable instructions,which can be embodied in a program or software. The instructions may bestored in a memory location, such as the memory 710. The instructionscan be directed to the CPU 705, which can subsequently program orotherwise configure the CPU 705 to implement methods of the presentdisclosure. Examples of operations performed by the CPU 705 can includefetch, decode, execute, and writeback.

The CPU 705 can be part of a circuit, such as an integrated circuit. Oneor more other components of the system 701 can be included in thecircuit. In some cases, the circuit is an application specificintegrated circuit (ASIC).

The storage unit 715 can store files, such as drivers, libraries andsaved programs. The storage unit 715 can store user data, e.g., userpreferences and user programs. The computer system 701 in some cases caninclude one or more additional data storage units that are external tothe computer system 701, such as located on a remote server that is incommunication with the computer system 701 through an intranet or theInternet.

The computer system 701 can communicate with one or more remote computersystems through the network 730. For instance, the computer system 701can communicate with a remote computer system of a user (e.g., aphysician, a laboratory technician, a genetic counselor, a scientist,among others). Examples of remote computer systems include personalcomputers (e.g., portable PC), slate or tablet PC's (e.g., Apple® iPad,Samsung® Galaxy Tab), telephones, Smart phones (e.g., Apple® iPhone,Android-enabled device, Blackberry®), or personal digital assistants.The user can access the computer system 701 via the network 730.

Methods as described herein can be implemented by way of machine (e.g.,computer processor) executable code stored on an electronic storagelocation of the computer system 701, such as, for example, on the memory710 or electronic storage unit 715. The machine executable or machinereadable code can be provided in the form of software. During use, thecode can be executed by the processor 705. In some cases, the code canbe retrieved from the storage unit 715 and stored on the memory 710 forready access by the processor 705. In some situations, the electronicstorage unit 715 can be precluded, and machine-executable instructionsare stored on memory 710.

The code can be pre-compiled and configured for use with a machinehaving a processer adapted to execute the code, or can be compiledduring runtime. The code can be supplied in a programming language thatcan be selected to enable the code to execute in a pre-compiled oras-compiled fashion.

Aspects of the systems and methods provided herein, such as the computersystem 801, can be embodied in programming. Various aspects of thetechnology may be thought of as “products” or “articles of manufacture”typically in the form of machine (or processor) executable code and/orassociated data that is carried on or embodied in a type of machinereadable medium. Machine-executable code can be stored on an electronicstorage unit, such as memory (e.g., read-only memory, random-accessmemory, flash memory) or a hard disk. “Storage” type media can includeany or all of the tangible memory of the computers, processors or thelike, or associated modules thereof, such as various semiconductormemories, tape drives, disk drives and the like, which may providenon-transitory storage at any time for the software programming. All orportions of the software may at times be communicated through theInternet or various other telecommunication networks. Suchcommunications, for example, may enable loading of the software from onecomputer or processor into another, for example, from a managementserver or host computer into the computer platform of an applicationserver. Thus, another type of media that may bear the software elementsincludes optical, electrical and electromagnetic waves, such as usedacross physical interfaces between local devices, through wired andoptical landline networks and over various air-links. The physicalelements that carry such waves, such as wired or wireless links, opticallinks or the like, also may be considered as media bearing the software.As used herein, unless restricted to non-transitory, tangible “storage”media, terms such as computer or machine “readable medium” refer to anymedium that participates in providing instructions to a processor forexecution.

Hence, a machine readable medium, such as computer-executable code, maytake many forms, including but not limited to, a tangible storagemedium, a carrier wave medium or physical transmission medium.Non-volatile storage media include, for example, optical or magneticdisks, such as any of the storage devices in any computer(s) or thelike, such as may be used to implement the databases, etc. shown in thedrawings. Volatile storage media include dynamic memory, such as mainmemory of such a computer platform. Tangible transmission media includecoaxial cables; copper wire and fiber optics, including the wires thatcomprise a bus within a computer system. Carrier-wave transmission mediamay take the form of electric or electromagnetic signals, or acoustic orlight waves such as those generated during radio frequency (RF) andinfrared (IR) data communications. Common forms of computer-readablemedia therefore include for example: a floppy disk, a flexible disk,hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD orDVD-ROM, any other optical medium, punch cards paper tape, any otherphysical storage medium with patterns of holes, a RAM, a ROM, a PROM andEPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wavetransporting data or instructions, cables or links transporting such acarrier wave, or any other medium from which a computer may readprogramming code and/or data. Many of these forms of computer readablemedia may be involved in carrying one or more sequences of one or moreinstructions to a processor for execution.

The computer system 701 can include or be in communication with anelectronic display 735 that comprises a user interface (UI) 740 forproviding, for example, personal or individualized patient reportsidentifying genomic variations or alterations, which may include tumorspecific genomic alterations and associated treatment options. Examplesof UI's include, without limitation, a graphical user interface (GUI)and web-based user interface. Data generated and displayed using a userinterface (740) may be accessed by a user, such as a healthcareprofessional, laboratory technician, genetic counselor, or a scientist,on the network.

Methods and systems of the present disclosure can be implemented by wayof one or more algorithms. An algorithm can be implemented by way ofsoftware upon execution by the central processing unit 705. Thealgorithm can, for example, sequence nucleic acids (e.g. massivelyparallel sequencing), group nucleic acid sequences, collapse groupednucleic acid sequences, generate consensus sequences, detect geneticvariations, update diagnostic confidence intervals, annotate sequences,generate reports, and execute other processes which may comprise one ormore of the following: Hidden Markov Model, dynamic programming,Bayesian network, trellis decoding, Viterbi decoding, expectationmaximization, Kalman filtering methodologies and neural networks .

The following examples are offered by way of illustration and not by wayof limitation.

EXAMPLES

FIG. 8 shows a graph of frequency of detected base changes (compared toa reference genome) in a DNA sample along 70 kb of sequence of aplurality of oncogenes amplified and sequenced using protocolsappropriate for Illumine sequencing. The sample was spiked with a lowpercentage of control DNA carrying sequence variants at known locations.These variants are represented by dark circles. Variants occurring atlog 0 (100%) or log −0.3 (0.5 or 50%) represent homozygous orheterozygous loci. Variants at less than log −2 (less than 1%) occur inthe noise range of this system, and may represent sequencing errors(noise) or actual variants (information). For any variant detected inthe noise range, it may not be possible to determine whether the variantrepresents noise or information. Amid the “noise”, one has diminishedconfidence that base calls at the mutant positions represent information(actual mutants) rather than noise. However, if the control DNA isspiked into a second sample, it should appear again at a similarfrequency. In contrast, the probability that an error is detected at thesame locus again is a function of the error rate, and is less likely tobe seen. The independent detection of the same variant increases theprobability that information, rather than noise, is being detected, andprovides increased confidence that a diagnosis of cancer is a correctone.

To the extent a sequencing error is the result of chance, theprobability of detecting the same sequencing error multiple times can beexponentially smaller than detecting it a single time. Thus, if aparticular signal is detected multiple times, it is more probablyinformation rather than noise. This characteristic can be used toincrease the probability that a genetic variant detected at low levelrepresents an actual polynucleotide or set of polynucleotides, ratherthan a sequencing artifact.

In one example, a signal indicating a pathology is detected in aplurality of instances. In certain embodiments, the signal is apolynucleotide bearing a somatic mutation associated with cancer or acopy number variation associated with cancer. Repeated detection of thesignal increases the probability that the signal represents informationrather than noise. The repeated instances include, without limitation,(1) repeated testing of the same sample, (2) testing of two samplestaken at the same time from a subject or (3) testing of two samplestaken at different times from a subject. Determining increasedprobability is particularly useful when the first detected signal is ata level that cannot be reliably differentiated from noise. The methodsof this disclosure find use, among other things, in monitoring a subjectover time for early detection of pathology, for example, when repeatedtesting detects pathology at levels which, in a single test, are too lowto reliably make a diagnosis of pathology.

In another example describing co-variate variants associated with lungcancer, a signal associated with a detected high confidence variation isdetected below the detection limit. If EGFR L858R activating mutation isdetected, the detection threshold for a co-variate resistance mutation,EGFR T790M resistance mutation, is relaxed. The independent detection ofthe activating or driver mutation increases confidence that a co-variatevariate within the detection threshold is also detected.

Methods and systems of the present disclosure may be combined with othermethods and systems, such as, for example, those described in PatentCooperation Treaty (PCT) Patent Publication Nos. WO/2014/039556,WO/2014/149134, WO/2015/100427 and WO/2015/175705, each of which isentirely incorporated herein by reference.

While preferred embodiments of the present invention have been shown anddescribed herein, it will be obvious to those skilled in the art thatsuch embodiments are provided by way of example only. It is not intendedthat the invention be limited by the specific examples provided withinthe specification. While the invention has been described with referenceto the aforementioned specification, the descriptions and illustrationsof the embodiments herein are not meant to be construed in a limitingsense. Numerous variations, changes, and substitutions will now occur tothose skilled in the art without departing from the invention.Furthermore, it shall be understood that all aspects of the inventionare not limited to the specific depictions, configurations or relativeproportions set forth herein which depend upon a variety of conditionsand variables. It should be understood that various alternatives to theembodiments of the invention described herein may be employed inpracticing the invention. It is therefore contemplated that theinvention shall also cover any such alternatives, modifications,variations or equivalents. It is intended that the following claimsdefine the scope of the invention and that methods and structures withinthe scope of these claims and their equivalents be covered thereby.

1. A method for analyzing a disease state of a subject, comprising: (a)using a genetic analyzer to generate genetic data from nucleic acidmolecules in biological samples of the subject obtained at (i) two ormore time points or (ii) substantially the same time point, wherein thegenetic data relates to genetic information of the subject, and whereinthe biological samples include a cell-free biological sample; (b)receiving the genetic data from the genetic analyzer; (c) with one ormore programmed computer processors, using the genetic data to producean adjusted test result in a characterization of the genetic informationof the subject; and (d) outputting the adjusted test result intocomputer memory.
 2. The method of claim 1, wherein the genetic datacomprises current sequence reads and prior sequence reads, and wherein(c) comprises comparing the current sequence reads with the priorsequence reads and updating a diagnostic confidence indicationaccordingly with respect to the characterization of the geneticinformation of the subject, which diagnostic confidence indication isindicative of a probability of identifying one or more geneticvariations in a biological sample of the subject.
 3. The method of claim2, further comprising generating a confidence interval for the currentsequence reads.
 4. The method of claim 3, further comprising comparingthe confidence interval with one or more prior confidence intervals anddetermining a disease progression based on overlapping confidenceintervals.
 5. The method of claim 1, wherein the biological samples areobtained at two or more time points including a first time point and asecond time point, and wherein (c) comprises increasing a diagnosticconfidence indication in a subsequent or a previous characterization ifthe information from the first time point corroborates information fromthe second time point.
 6. The method of claim 1, wherein the biologicalsamples are obtained at two or more time points including a first timepoint and a second time point, and wherein (c) comprises increasing adiagnostic confidence indication in the subsequent characterization ifthe information from the first time point corroborates information fromthe second time point.
 7. The method of claim 1, wherein a firstco-variate variation is detected in the genetic data, and wherein (c)comprises increasing a diagnostic confidence indication in thesubsequent characterization if a second co-variate variation isdetected.
 8. The method of claim 1, wherein the biological samples areobtained at two or more time points including a first time point and asecond time point, and wherein (c) comprises decreasing a diagnosticconfidence indication in the subsequent characterization if theinformation from a first time point conflicts with information from thesecond time point.
 9. The method of claim 1, further comprisingobtaining a subsequent characterization and leaving as is a diagnosticconfidence indication in the subsequent characterization for de novoinformation.
 10. The method of claim 1, further comprising determining afrequency of one or more genetic variants detected in a collection ofsequence reads included in the genetic data and producing the adjustedtest result at least in part by comparing the frequency of the one ormore genetic variants at the two or more time points.
 11. The method ofclaim 1 further comprising determining an amount of copy numbervariation at one or more genetic loci detected in a collection ofsequence reads included in the genetic data and producing the adjustedtest result at least in part by comparing the amount at the two or moretime points.
 12. The method of claim 1, further comprising using theadjusted test result to provide (i) a therapeutic intervention or (ii) adiagnosis of a health or disease to the subject.
 13. The method of claim1, wherein the genetic data comprises sequence data from potions of agenome comprising disease-associated or cancer associated geneticvariants.
 14. The method of claim 1, further comprising using theadjusted test result to increase a sensitivity of detecting geneticvariants by increasing read depth of polynucleotides in a sample fromthe subject.
 15. The method of claim 1, wherein the genetic datacomprises a first set of genetic data and a second set of genetic data,wherein the first set of genetic data is at or below a detectionthreshold and the second set of genetic data is above the detectionthreshold.
 16. The method or claim 15, wherein the detection thresholdis a noise threshold.
 17. The method of claim 15, further comprising, in(c), adjusting a diagnosis of the subject from negative or uncertain topositive when the same genetic variants are detected in the first set ofgenetic data and the second set of genetic data in a plurality ofsampling instances or time points.
 18. The method of claim 15, furthercomprising, in (c), adjusting a diagnosis of the subject from negativeor uncertain to positive in a characterization from an earlier timepoint when the same genetic variants are detected in the first set ofgenetic data at an earlier time point and in the second set of geneticdata at a later time point.
 19. The method of claim 1, wherein thedisease state is cancer and the genetic analyzer is a nucleic acidsequencer.
 20. The method of claim 1, wherein the biological samplesinclude at least two different types of biological samples.
 21. Themethod of claim 1, wherein the biological samples include the same typeof biological sample.
 22. The method of claim 21, wherein the biologicalsamples are blood samples.
 23. The method of claim 22, wherein thenucleic acid molecules are cell-free deoxyribonucleic acid (DNA).24.-57. (canceled)